Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patatrack integration - HCAL local reconstruction (8/N) #31720

Merged

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Oct 8, 2020

PR description:

Data formats and algorithms for the HCAL local reconstruction running on GPU.
Implements the HBHE unpacking and the production of HCAL uncalibrated rechits, including the MAHI algorithm.

PR validation:

Changes in use in the Patatrack releases.

if this PR is a backport please specify the original PR and why you need to backport that PR:

Includes changes from:

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 8, 2020

The code-checks are being triggered in jenkins.

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 8, 2020

For all questions, please address @mariadalfonso @vkhristenko .
For all changes, please make PRs against cms-patatrack:patatrack_integration_8_N_hcal_local_reco .

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 8, 2020

-code-checks

ERROR: Build errors found during clang-tidy run.

CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1522 warnings (1521 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 794 warnings (793 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 794 warnings (793 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1214 warnings (1213 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalRecHitSoA/interface/RecHitCollection.h:6:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 793 warnings (792 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1269 warnings (1268 in non-user code, 1 with check filters).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:128: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@abdoulline
Copy link

I guess that for HCALxxxGPU conditions (.h and .cc) a natural place would be the same one as for regular conditions:
https://cmssdt.cern.ch/lxr/source/CondFormats/HcalObjects/
and not RecoLocalCalo/HcalRecAlgos ?

fwyzard added a commit to fwyzard/cms-bot that referenced this pull request Oct 10, 2020
See
[#31703](cms-sw/cmssw#31703) Patatrack integration - common data formats (5/N)
[#31704](cms-sw/cmssw#31704) Patatrack integration - calorimeters shared code (6/N)
[#31719](cms-sw/cmssw#31719) Patatrack integration - ECAL local reconstruction (7/N)
[#31720](cms-sw/cmssw#31720) Patatrack integration - HCAL local reconstruction (8/N)
[#31721](cms-sw/cmssw#31721) Patatrack integration - Pixel local reconstruction (9/N)
[#31722](cms-sw/cmssw#31722) Patatrack integration - Pixel track reconstruction (10/N)
[#31723](cms-sw/cmssw#31723) Patatrack integration - Pixel vertex reconstruction (11/N)
fwyzard added a commit to fwyzard/cms-bot that referenced this pull request Oct 10, 2020
See
  - cms-sw/cmssw#31703: Patatrack integration - common data formats (5/N)
  - cms-sw/cmssw#31704: Patatrack integration - calorimeters shared code (6/N)
  - cms-sw/cmssw#31719: Patatrack integration - ECAL local reconstruction (7/N)
  - cms-sw/cmssw#31720: Patatrack integration - HCAL local reconstruction (8/N)
  - cms-sw/cmssw#31721: Patatrack integration - Pixel local reconstruction (9/N)
  - cms-sw/cmssw#31722: Patatrack integration - Pixel track reconstruction (10/N)
  - cms-sw/cmssw#31723: Patatrack integration - Pixel vertex reconstruction (11/N)
Copy link
Contributor

@perrotta perrotta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find below a few, mostly stlish, comments.
We expect to get also some cpu/gpu validation plot.
And we are waiting for the resolution of the build errors found during clang-tidy run to proceed with the automatic tests and the next steps in the review

}

template <>
constexpr uint32_t compute_nsamples<Flavor5>(uint32_t const nwords) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this specialization, since WORDS_PER_SAMPLE = 1 / SAMPLES_PER_WORD ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this is to avoid a division of an integer by 0.5, which would involve floating point operations - while multiplying by 2 can be done with integer operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is: why cannot you use the previous templated compute_nsamples also for Flavor = Flavor5 ? (Then I agree with you that in all cases one could simply multiply by SAMPLES_PER_WORD instead of dividing by WORDS_PER_SAMPLE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other flavors were removed for which WORDS_PER_SAMPLE was 2 (flavor 2), which means you again have floating point in there... i just tried to keep only int operations in there...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@perrotta

why cannot you use the previous templated compute_nsamples also for Flavor = Flavor5 ?

The generic version is (lines 89-92)

  template <typename Flavor>
  constexpr uint32_t compute_nsamples(uint32_t const nwords) {
    return (nwords - Flavor::HEADER_WORDS) / Flavor::WORDS_PER_SAMPLE;
  }

For Flavor5 that would become

  constexpr uint32_t compute_nsamples(uint32_t const nwords) {
    return (nwords - 1) / 0.5;
  }

which does not look very sensible.

The Flavor5 specialisation instead becomes

  constexpr uint32_t compute_nsamples(uint32_t const nwords) {
    return (nwords - 1) * 2;
  }

which avoids floating point math for an integer multiplication by 2.

Then I agree with you that in all cases one could simply multiply by SAMPLES_PER_WORD instead of dividing by WORDS_PER_SAMPLE

I have no idea if - in principle - there can be cases where WORDS_PER_SAMPLE is greater than 1.
If that is not foreseen, then indeed we can change the generic version to

  template <typename Flavor>
  constexpr uint32_t compute_nsamples(uint32_t const nwords) {
    return (nwords - Flavor::HEADER_WORDS) * Flavor::SAMPLES_PER_WORD;
  }

If we want to keep this more generic, it could become

  template <typename Flavor>
  constexpr uint32_t compute_nsamples(uint32_t const nwords) {
    if constexpr(Flavor::SAMPLES_PER_WORD >= 1)
      return (nwords - Flavor::HEADER_WORDS) * Flavor::SAMPLES_PER_WORD;
    else
      return (nwords - Flavor::HEADER_WORDS) / Flavor::WORDS_PER_SAMPLE;
  }

CUDADataFormats/HcalDigi/interface/DigiCollection.h Outdated Show resolved Hide resolved

// get to the payload
auto const* payload64 = buffer + 2 + namc + amcoffset;
//amcoffset += amcSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this line is not needed, it can be removed. Otherwise just add another comment line which explains why do you want to keep here even if commented out

#ifdef HCAL_RAWDECODE_GPUDEBUG
// uhtr header v1 1st 64 bits
auto const payload64_w0 = payload64[0];
//uint32_t const data_length64 = payload64_w0 & 0xfffff;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this line is not needed, it can be removed. Otherwise just add another comment line which explains why do you want to keep here even if commented out


// FIXME: shift should be treated properly,
// here assume 8 time slices and 8 samples
auto const shift = 4 - soi; // as in cpu version!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this "4" be made a named constant?

if (npassive == 0)
break;

//s.head(npassive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented out lines, if not needed; or alternatively add another comment line which explains why they must be kept here

auto const i_real = pulseOffsets(i);
solution(i_real) += alpha * (s(i) - solution(i_real));
}
//solution.head(npassive) += alpha *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented out lines, if not needed; or alternatively add another comment line which explains why they must be kept here

auto const id = gch < nchannelsf01HE
? idsf01HE[gch]
: (gch < nchannelsf015 ? idsf5HB[gch - nchannelsf01HE] : idsf3HB[gch - nchannelsf015]);
//auto const id = gch >= nchannelsf01HE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented out lines, if not needed; or alternatively add another comment line which explains why they must be kept here

#pragma unroll
for (int i = 0; i < NPULSES; ++i)
pulseOffsets(i) = i;
// pulseOffsets(i) = pulseOffsetValues[i] - pulseOffsetValues[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented out lines, if not needed; or alternatively add another comment line which explains why they must be kept here

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 19, 2020

And we are waiting for the resolution of the build errors found during clang-tidy run to proceed with the automatic tests and the next steps in the review

That's a valid point, but it requires any one of the following:

Please lt us know which is/are the preferred option/s.

@perrotta
Copy link
Contributor

@fwyzard let wait for #31704 being merged, hoping it will happen soonish

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-code-checks

ERROR: Build errors found during clang-tidy run.

CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1476 warnings (1475 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 716 warnings (715 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 716 warnings (715 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1134 warnings (1133 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalRecHitSoA/interface/RecHitCollection.h:6:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 715 warnings (714 in non-user code, 1 with check filters).
--
CUDADataFormats/HcalDigi/interface/DigiCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1189 warnings (1188 in non-user code, 1 with check filters).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:128: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 3, 2020

The four differences are due to a different number of LogError messages:

plotting variable: edmErrorSummaryEntrys_logErrorHarvester__RECO.obj@.size()
edmErrorSummaryEntrys_logErrorHarvester__RECO.obj@.size() has 2 differences
plotting variable: edmErrorSummaryEntrys_logErrorHarvester__RECO.obj.count
edmErrorSummaryEntrys_logErrorHarvester__RECO.obj.count has 2 differences
plotting variable: edmErrorSummaryEntrys_logErrorHarvester__RECO.obj.module.size()
edmErrorSummaryEntrys_logErrorHarvester__RECO.obj.module.size() has 2 differences
plotting variable: edmErrorSummaryEntrys_logErrorHarvester__RECO.obj.category.size()
edmErrorSummaryEntrys_logErrorHarvester__RECO.obj.category.size() has 2 differences

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 3, 2020

For the 312.0 workflow, the only difference seem to be one more

+%MSG-w XrdAdaptor:  MixingModule:mix 03-Dec-2020 09:36:29 CET Run: 1 Event: 1
+Data is served from cern.ch instead of original site eoscms
+%MSG

in step 2, and one less

-%MSG-w XrdAdaptor:  MixingModule:mix 03-Dec-2020 04:14:48 CET Run: 1 Event: 1
-Data is served from cern.ch instead of original site eoscms
-%MSG

in step 3.

For the 1325.7 workflow, I see some actual differences in the logs.
In step 2 this part has changes:

@@ -6,20 +11,11 @@ Step: DQM Spec: ['@nanoAODDQM']
 customising the process with nanoAOD_customizeMC from PhysicsTools/NanoAOD/nano_cff
 Updating process to run DeepFlavour btag
 Will recalculate the following discriminators: pfDeepFlavourJetTags:probb, pfDeepFlavourJetTags:probbb, pfDeepFlavourJetTags:problepb, pfDeepFlavourJetTags:probc
-**************************************************************
-b tagging needs to be run on uncorrected jets. Hence, the JECs
-will first be undone for 'updatedPatJetsWithDeepInfo' and then applied to
-'updatedPatJetsTransientCorrectedWithDeepInfo'.
-**************************************************************
 Updating process to run DeepBoostedJet on datasets before 103X
 Updating process to run ParticleNet before it's included in MiniAOD
 Updating process to run DeepDoubleX on datasets before 104X
 Updating process to run DeepDoubleXv2 on datasets before 11X
-Will recalculate the following discriminators on AK8 jets: pfDeepBoostedJetTags:probTbcq, pfDeepBoostedJetTags:probTbqq, pfDeepBoostedJetTags:probTbc, pfDeepBoostedJetTags:probTbq, pfDeepBoostedJetTags:probWcq, pfDeepBoostedJetTags:probWqq, pfDeepBoostedJetTags:probZbb,
+Will recalculate the following discriminators on AK8 jets: pfDeepBoostedJetTags:probTbcq, pfDeepBoostedJetTags:probTbqq, 
-add DeepMET Producers
-customising the process with customiseWithTimeMemoryJobReport from Validation/Performance/TimeMemoryJobReport
-Starting  cmsRun -j JobReport2.xml  step2_NANO_DQM.py
-**************************************************************
pfDeepBoostedJetTags:probTbc, pfDeepBoostedJetTags:probTbq, pfDeepBoostedJetTags:probWcq, pfDeepBoostedJetTags:probWqq, pfDeepBoostedJetTags:probZbb,
 b tagging needs to be run on uncorrected jets. Hence, the JECs
 will first be undone for 'updatedPatJetsWithDeepInfo' and then applied to
 'updatedPatJetsTransientCorrectedWithDeepInfo'.

and in step 3

-HARVESTING:@nanoAODDQM
-entry file:step2_inDQM.root
-Step: HARVESTING Spec: ['@nanoAODDQM']
-customising the process with customiseWithTimeMemoryJobReport from Validation/Performance/TimeMemoryJobReport
-Starting  cmsRun -j JobReport3.xml  step3_HARVESTING.py

I doubt any of them is due to this PR, though.

@perrotta
Copy link
Contributor

perrotta commented Dec 3, 2020

I have run the comparison scripts in RecoLocalCalo/HcalRecProducers/test, and here are a few CPU-GPU comparison for the reconstructed quantities.

  • Mahi energy looks perfectly in agreement between the two:
    HcalEnergyMahi

  • On the other hand, M0 energies have a significantly different scale (lower in GPU), and they further show some different structure:
    HcalEnergyM0

Is there any comment about the M0 distributions?

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 3, 2020

I brought this up with @mariadalfonso some time ago, who told me (translated from italian)

that's because some parts have not been implemented yet; there are some TODO left in the code by Viktor

@vkhristenko
Copy link
Contributor

vkhristenko commented Dec 3, 2020 via email

@vkhristenko
Copy link
Contributor

vkhristenko commented Dec 3, 2020 via email

@mariadalfonso
Copy link
Contributor

@perrotta
correct: M0 has some left over TODO that I need to fix ( M0 it's not used in RECO, we mainly use as validation).

Once this coreHcalPR is merged: this is the plan

  1. finalize the M0
  2. implement an additional modification to follow up on recently noise correlation that are important in view of the Run3 aging see code updates to add a noise correlation factor between adjacent time slices for HB and HE channels #31064, HCAL MAHI code update to take only pedestal noise terms for correlation terms of a noise covariance matrix #31513, + another one that is about to come to complete it. I mentioned this in my talk at RECO meeting.
  3. extend the DQM to make the plots we make now by hands

The important point in my opinion, this coreHcalPR have reached all the needed ingredients to continue in cmssw w/o passing in cms-patatrack.

@perrotta
Copy link
Contributor

perrotta commented Dec 4, 2020

Thank you @fwyzard @vkhristenko @mariadalfonso for your investigation and comments

For what I can see, what's missing for M0 is just the insertion of a couple of scale factors: isn't it something basically ready to be easily implemented, and that could therefore already enter this PR?

Going back at the past presentation at the reco meeting cited by @mariadalfonso: was the issue of the 0 GPU energy finally solved, after that presentation?

I agree that all other points listed in #31720 (comment) can be postponed to a forthcoming update PR. Even the two items above can be postponed, if they are really too far from getting a solution now, although if they are ready to be implemented I don't see why not to integrate them here: please let us know.

@mariadalfonso
Copy link
Contributor

Going back at the past presentation at the reco meeting cited by @mariadalfonso: was the issue of the 0 GPU energy finally solved, after that presentation?

yes this has been fixed, see here
cms-patatrack#576 (comment)

@perrotta
Copy link
Contributor

perrotta commented Dec 7, 2020

@vkhristenko @mariadalfonso please let us know what do you intend to do about the missing scale factor(s) for the method0. That seems really a couple of multiplicative factors that are now missing, and which should be easy to get (if you don't have them already available): please let us know if there are issues I can't imagine that suggest to delay applying the fix to a follow up PR, or if we can implement it already here, instead.

@vkhristenko
Copy link
Contributor

@perrotta
We (me and @mariadalfonso) would prefer that things get merged first. regarding what's missing exactly and other ref see this patatrack pr (cms-patatrack#374). In short, we have to add containment correction. And although its application is a simple mult, to get this double, there is some logic, which also involves std::map.... which i personally never wanted to understand and which obviously has to be converted...

therefore, we would like to postpone and update this after things get merged, especially since it is affecting M0 only and not Mahi.

@abdoulline
Copy link

abdoulline commented Dec 7, 2020 via email

@perrotta
Copy link
Contributor

perrotta commented Dec 8, 2020

+1

  • Code changes have been approved by the patatrack team, as well as by the HCAL DPG
  • Method0 still needs to be finalized in the gpu version, but mahi looks correctly translated and reproduces well in both architectures
  • The gpu part of this PR got tested in cms-oc-gpu-01 by verifying that both the test scripts provided, and the dedicated HCal only workflows 136.888521 and 136.888522 can run without crashing on the gpu
  • Jenkins tests pass and show no differences in the "regular" workflows wrt the baseline

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 8, 2020

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 8, 2020

I've created #32413 and added a link to cms-patatrack#539 to keep track of the open issues on the HCAL side.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 8, 2020

The important point in my opinion, this coreHcalPR have reached all the needed ingredients to continue in cmssw w/o passing in cms-patatrack.

@mariadalfonso, since you seem so unhappy about the integration via cms-patatrack, I assume all further development will happen only in the master cms-sw repository.

What are the plans for keeping the performance of the GPU workflow under control during the development and validation of future PRs, in terms of both physics results and e.g. event processing throughput ?

@silviodonato
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants