Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM modules to monitor Pixel RecHits for heterogenous workflows #37174

Merged
merged 1 commit into from Mar 22, 2022

Conversation

sroychow
Copy link
Contributor

@sroychow sroychow commented Mar 8, 2022

PR description:

This PR introduces Rechi monitoring modules for pixels for CPU and GPU workflows. There is a base class where the histo booking, filling are done. Related talk

PR validation:

Checks done with 11634.501, 11634.506

if this PR is a backport please specify the original PR and why you need to backport that PR:

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37174/28739

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2022

A new Pull Request was created by @sroychow (Suvankar Roy Chowdhury) for master.

It involves the following packages:

  • DQM/SiPixelPhase1Heterogeneous (dqm)

@emanueleusai, @ahmad3213, @cmsbuild, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please review it and eventually sign? Thanks.
@hdelanno, @fioriNTU, @jandrea, @idebruyn, @mmusich, @threus this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@tsusa
Copy link
Contributor

tsusa commented Mar 8, 2022

please test

@sroychow
Copy link
Contributor Author

sroychow commented Mar 8, 2022

enable gpu

Copy link
Contributor

@mmusich mmusich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some quick preliminary comments.

}
}

void SiPixelPhase1MonitorRecHitsSoABase::fillHistosForRecHit(const DetId& id, const float xG, const float yG, const float zG, const float rG, const float fphi, const uint32_t charge, const int16_t sizeX, const int16_t sizeY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having a large set of arguments passed, wouldn't it be more elegant to store them in a POD struct or something to that effect?

@tsusa
Copy link
Contributor

tsusa commented Mar 8, 2022

urgent
needed for GPU/CPU comparison in 123X

@cmsbuild cmsbuild added the urgent label Mar 8, 2022
//
#ifndef DQM_SiPixelPhase1Heterogeneous_SiPixelPhase1MonitorRecHitsSoABase_h
#define DQM_SiPixelPhase1Heterogeneous_SiPixelPhase1MonitorRecHitsSoABase_h
#include "FWCore/Framework/interface/ESHandle.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

//
// Author: Alessandro Rossi, Suvankar Roy Chowdhury
//
#include "DataFormats/Math/interface/approx_atan2.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this needed for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for phi conversion from short to float (short2phi) on line 55

@fwyzard
Copy link
Contributor

fwyzard commented Mar 8, 2022

assign heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2022

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

Copy link
Contributor

@mmusich mmusich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this other set is minor and I guess it could go into a follow-up.

static void fillDescriptions(edm::ConfigurationDescriptions& descriptions);

private:
edm::EDGetTokenT<TrackingRecHit2DCPU> tokenSoAHitsCPU_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be const (if then put into the initializer list)

Comment on lines 28 to 29
edm::EDGetTokenT<cms::cuda::Product<TrackingRecHit2DGPU>> tokenHitsGPU_;
edm::EDGetTokenT<SiPixelClusterCollectionNew> clusterToken_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
edm::EDGetTokenT<cms::cuda::Product<TrackingRecHit2DGPU>> tokenHitsGPU_;
edm::EDGetTokenT<SiPixelClusterCollectionNew> clusterToken_;
const edm::EDGetTokenT<cms::cuda::Product<TrackingRecHit2DGPU>> tokenHitsGPU_;
const edm::EDGetTokenT<SiPixelClusterCollectionNew> clusterToken_;

if they're put into the initializer's list at line 40-41.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the convention for the member data is inconsistent (either use leading m_ or trailing _)

: geomToken_(esConsumes<TrackerGeometry, TrackerDigiGeometryRecord, edm::Transition::BeginRun>()),
topoToken_(esConsumes<TrackerTopology, TrackerTopologyRcd, edm::Transition::BeginRun>()) {
topFolderName_ = iConfig.getParameter<std::string>("TopFolderName"); //"SiPixelHeterogeneous/PixelRecHitsSoA";
onGPU_ = iConfig.getParameter<bool>("onGPU"); //unused now, but may be needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to have a parameter without any effect, because setting it to True or False will be misleading :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, better add it later, if it becomes needed.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 8, 2022

@sroychow @arossi83 if you haven't already, could you check that the new module works correctly

  • in the 11634.501 workflow on a machine without GPUs (e.g. lxplus)
  • in the 11634.502 workflow on a machine without GPUs (e.g. lxplus)
  • in the 11634.502 workflow on a machine with GPUs (e.g. lxplus-gpu)

and that the plots look OK in the three cases ?

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2022

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5c87f9/22939/summary.html
COMMIT: 886edba
CMSSW: CMSSW_12_3_X_2022-03-08-1100/slc7_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37174/22939/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19811
  • DQMHistoTests: Total failures: 1928
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 17883
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 7479.093 KiB( 3 files compared)
  • DQMHistoSizes: changed ( 11634.506 ): 7479.093 KiB SiPixelHeterogeneous/PixelRecHitsSoA
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3985573
  • DQMHistoTests: Total failures: 15
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3985535
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@@ -0,0 +1,3 @@
<export>
<lib name="1"/>
</export>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file shouldn't be needed.

static void fillPSetDescription(edm::ParameterSetDescription& desc);

protected:
edm::ParameterSet config_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appears to be unused

Suggested change
edm::ParameterSet config_;

const auto& rhsoaHandle = iEvent.getHandle(tokenSoAHitsCPU_);
if (!rhsoaHandle.isValid())
return;
auto const& rhsoa = iEvent.get(tokenSoAHitsCPU_); //*((rhsoaHandle.product())->get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto const& rhsoa = iEvent.get(tokenSoAHitsCPU_); //*((rhsoaHandle.product())->get());
auto const& rhsoa = *rhsoaHandle;

would be cheaper and more idiomatic.

const auto& rhsoaHandle = iEvent.getHandle(tokenHitsGPU_);
if (!rhsoaHandle.isValid())
return;
const auto& rho = iEvent.get(tokenHitsGPU_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto& rho = iEvent.get(tokenHitsGPU_);
const auto& rho = *rhsoaHandle;

m_store32 = rhsoa.localCoordToHostAsync(ctx.stream());
m_hitsModuleStart = rhsoa.hitsModuleStartToHostAsync(ctx.stream());
auto xl = m_store32.get();
auto yl = xl + m_nHits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no synchronization between these lines and where xl, yl, or m_hitsModuleStart are used.

The preferred solution would be to use ExternalWork, but apparently the DQMEDAnalyzer base class does not support additional extensions.

In general the next best solution would be to add an additional ExternalWork-EDProducer to do the transfer. (hopefully in the near future we'd find a way to avoid the explicit transfer-EDProducers, but it's possible that will be only for Alpaka-based code)

At minimum cudaCheck(cudaStreamSynchronize(ctx.stream())); would be needed (guarantees safety, but is not necessarily as efficient).

Comment on lines 30 to 31
cms::cuda::host::unique_ptr<float[]> m_store32;
cms::cuda::host::unique_ptr<uint32_t[]> m_hitsModuleStart;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would be better as local variables (because they are used in only one member function).

@arossi83
Copy link
Contributor

arossi83 commented Mar 9, 2022

@sroychow @arossi83 if you haven't already, could you check that the new module works correctly

  • in the 11634.501 workflow on a machine without GPUs (e.g. lxplus)
  • in the 11634.502 workflow on a machine without GPUs (e.g. lxplus)
  • in the 11634.502 workflow on a machine with GPUs (e.g. lxplus-gpu)

and that the plots look OK in the three cases ?

@fwyzard the module works correctly in 11634.501 wf on lxplus and on 11634.502 wf on lxplus-gpu and plots are OK but it crashes on 11634.502 wf on lxplus.
We use the gpu procModifier to select which product to consume so in this "mixed" case (gpu procModifier but on a machine without a GPU) we try to access the CUDA RecHit anyway and this generates the crash.
According to this there is no a mechanism to avoid this actually. if you have any suggestion please let us know.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37174/28926

  • This PR adds an extra 16KB to repository

@fwyzard
Copy link
Contributor

fwyzard commented Mar 21, 2022

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5c87f9/23258/summary.html
COMMIT: cf7abce
CMSSW: CMSSW_12_4_X_2022-03-21-1100/slc7_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37174/23258/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19811
  • DQMHistoTests: Total failures: 2364
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 17447
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 8031.437 KiB( 3 files compared)
  • DQMHistoSizes: changed ( 11634.506 ): 8031.437 KiB SiPixelHeterogeneous/PixelRecHitsSoA
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-5c87f9/39434.911_TTbar_14TeV+2026D88_DD4hep+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3695650
  • DQMHistoTests: Total failures: 51178
  • DQMHistoTests: Total nulls: 64
  • DQMHistoTests: Total successes: 3644386
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@jfernan2
Copy link
Contributor

+1

@fwyzard
Copy link
Contributor

fwyzard commented Mar 22, 2022

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants