Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patatrack integration - ECAL local reconstruction (7/N) #31719

Merged

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Oct 8, 2020

PR description:

Data formats and algorithms for the ECAL local reconstruction running on GPU.
Implements the ECAL unpacking and the production of ECAL uncalibrated rechits, including the multifit algorithm

PR validation:

Changes in use in the Patatrack releases.

if this PR is a backport please specify the original PR and why you need to backport that PR:

Includes changes from:

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 8, 2020

For all questions, please address @amassiro @vkhristenko .
For all changes, please make PRs against cms-patatrack:patatrack_integration_7_N_ecal_local_reco .

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 8, 2020

The code-checks are being triggered in jenkins.

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 8, 2020

@cmsbuild, please test with #31703 #31704

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 8, 2020

enable gpu

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 8, 2020

-code-checks

ERROR: Build errors found during clang-tidy run.

CUDADataFormats/EcalRecHitSoA/interface/EcalUncalibratedRecHit.h:9:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1530 warnings (1529 in non-user code, 1 with check filters).
--
CUDADataFormats/EcalRecHitSoA/interface/EcalRecHit.h:9:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1604 warnings (1603 in non-user code, 1 with check filters).
--
CUDADataFormats/EcalDigi/interface/DigisCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 909 warnings (908 in non-user code, 1 with check filters).
--
CUDADataFormats/EcalDigi/interface/DigisCollection.h:4:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 909 warnings (908 in non-user code, 1 with check filters).
--
CUDADataFormats/EcalRecHitSoA/interface/EcalUncalibratedRecHit.h:9:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1100 warnings (1099 in non-user code, 1 with check filters).
--
CUDADataFormats/EcalRecHitSoA/interface/EcalRecHit.h:9:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 820 warnings (819 in non-user code, 1 with check filters).
--
CUDADataFormats/EcalRecHitSoA/interface/EcalRecHit.h:9:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 1100 warnings (1099 in non-user code, 1 with check filters).
--
CUDADataFormats/EcalRecHitSoA/interface/EcalUncalibratedRecHit.h:9:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 800 warnings (799 in non-user code, 1 with check filters).
--
CUDADataFormats/EcalRecHitSoA/interface/EcalUncalibratedRecHit.h:9:10: error: 'CUDADataFormats/CaloCommon/interface/Common.h' file not found [clang-diagnostic-error]
#include "CUDADataFormats/CaloCommon/interface/Common.h"
         ^
Suppressed 954 warnings (953 in non-user code, 1 with check filters).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:128: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

fwyzard added a commit to fwyzard/cms-bot that referenced this pull request Oct 10, 2020
See
[#31703](cms-sw/cmssw#31703) Patatrack integration - common data formats (5/N)
[#31704](cms-sw/cmssw#31704) Patatrack integration - calorimeters shared code (6/N)
[#31719](cms-sw/cmssw#31719) Patatrack integration - ECAL local reconstruction (7/N)
[#31720](cms-sw/cmssw#31720) Patatrack integration - HCAL local reconstruction (8/N)
[#31721](cms-sw/cmssw#31721) Patatrack integration - Pixel local reconstruction (9/N)
[#31722](cms-sw/cmssw#31722) Patatrack integration - Pixel track reconstruction (10/N)
[#31723](cms-sw/cmssw#31723) Patatrack integration - Pixel vertex reconstruction (11/N)
fwyzard added a commit to fwyzard/cms-bot that referenced this pull request Oct 10, 2020
See
  - cms-sw/cmssw#31703: Patatrack integration - common data formats (5/N)
  - cms-sw/cmssw#31704: Patatrack integration - calorimeters shared code (6/N)
  - cms-sw/cmssw#31719: Patatrack integration - ECAL local reconstruction (7/N)
  - cms-sw/cmssw#31720: Patatrack integration - HCAL local reconstruction (8/N)
  - cms-sw/cmssw#31721: Patatrack integration - Pixel local reconstruction (9/N)
  - cms-sw/cmssw#31722: Patatrack integration - Pixel track reconstruction (10/N)
  - cms-sw/cmssw#31723: Patatrack integration - Pixel vertex reconstruction (11/N)
@slava77
Copy link
Contributor

slava77 commented Oct 12, 2020

@fwyzard
I guess we need at least non-blocking parts to work.
please clarify when/how this PR will get to the point of passing the tests.
Thank you.

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 12, 2020

The PRs have an order for a reason, and this PR needs to be tested together with #31703 (now merged) and #31704 .

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 12, 2020

please test with #31704

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 12, 2020

So... it looks like it's not possible to run the code checks together with an other PR ?

Suggestions on how to proceed ?

@jpata
Copy link
Contributor

jpata commented Oct 14, 2020

PRs should be testable.

An option would be to improve the testing framework, but I'm not sure if it's a big additional complexity there to reliably test multiple PRs together.

Any other suggestion than progressing with #31704, so that this PR here can be tested automatically?

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 14, 2020

PRs should be testable.

See #27983 for a single fully testable PR.

Copy link
Contributor

@jpata jpata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm about 30% through with reviewing the files in this PR.

The major item I would like to ask is to put references to the original CPU impementation in the comments. This will make the review more smooth, as well as assist future readers and developers.

CUDADataFormats/EcalRecHitSoA/interface/EcalRecHit.h Outdated Show resolved Hide resolved
static constexpr int kTowersInPhi = 4; // see EBDetId
static constexpr int kCrystalsInPhi = 20; // see EBDetId

static constexpr uint8_t MAX_DCCID = 54; //To be updated with correct and final number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, for all such values and classes that are reimplemented & copied from the CPU version, I would like to ask to put a reference to the original CPU code in the GPU code comments, both to facilitate review right now and to allow future visitors to compare the two versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a few lines of comments in the top of the file, referencing the original CPU implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually go further and request that all common values and types be shared between the two implementations.

However, this will likely require changes to the original CPU code to expose such values and types.

EventFilter/EcalRawToDigi/interface/ElectronicsIdGPU.h Outdated Show resolved Hide resolved
}
}

__forceinline__ __device__ void print_first3bits(uint64_t const* buffer, uint32_t size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, put them in a central place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printing 3 specific bits out of an 8-byte word is defnitely not something of central interest

auto* pChannelsCounter = isBarrel ? &pChannelsCounterEBEE[0] : &pChannelsCounterEBEE[1];

// FIXME: debugging
//printf("ifed = %u fed = %d offset = %u size = %u\n", ifed, fed, offset, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove all commented-out debugging lines here and elsewhere

sampleValues[7] = wdata2 & 0x3fff;
sampleValues[8] = (wdata2 >> 16) & 0x3fff;
sampleValues[9] = (wdata2 >> 32) & 0x3fff;
//printf("stripid = %u xtalid = %u\n", stripid, xtalid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 15, 2020

Include guards and relative comments should be fixed by the latest commit.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a29982/12297/summary.html
COMMIT: 92f3342
CMSSW: CMSSW_11_3_X_2021-01-14-2300/slc7_amd64_gcc900
Additional Tests: GPU

RelVals-INPUT

250202.172_TTbar_13UP17+TTbar_13UP17INPUT+DIGIPRMXUP17_PU25_RD+RECOPRMXUP17_PU25+HARVESTUP17_PU25/step2_TTbar_13UP17+TTbar_13UP17INPUT+DIGIPRMXUP17_PU25_RD+RECOPRMXUP17_PU25+HARVESTUP17_PU25.log
250202.17_TTbar_13UP17+TTbar_13UP17INPUT+DIGIPRMXUP17_PU25+RECOPRMXUP17_PU25+HARVESTUP17_PU25/step2_TTbar_13UP17+TTbar_13UP17INPUT+DIGIPRMXUP17_PU25+RECOPRMXUP17_PU25+HARVESTUP17_PU25.log
250208.17_SMS-T1tttt_mGl-1500_mLSP-100_13UP17+SMS-T1tttt_mGl-1500_mLSP-100_13UP17INPUT+DIGIPRMXUP17_PU25+RECOPRMXUP17_PU25+HARVESTUP17_PU25/step2_SMS-T1tttt_mGl-1500_mLSP-100_13UP17+SMS-T1tttt_mGl-1500_mLSP-100_13UP17INPUT+DIGIPRMXUP17_PU25+RECOPRMXUP17_PU25+HARVESTUP17_PU25.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 15 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2716961
  • DQMHistoTests: Total failures: 18
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2716920
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 36 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 156 log files, 37 edm output root files, 37 DQM output files

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 15, 2021

The failures are due to

%MSG-w XrdAdaptorInternal:  PreMixingModule:mixData 15-Jan-2021 11:32:18 CET  Run: 1 Event: 5604
Failed to open file at URL root://eoscms.cern.ch:1094//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root?xrdcl.requuid=02658723-6392-43a8-90c0-b2f86a77e313.
%MSG
%MSG-w XrdAdaptorInternal:  PreMixingModule:mixData 15-Jan-2021 11:32:18 CET  Run: 1 Event: 5604
Failed to open file at URL root://eoscms.cern.ch:1094//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root?tried=&xrdcl.requuid=24bd7721-bc7d-434c-806b-f2f10a542541.
%MSG
15-Jan-2021 11:32:18 CET  Initiating request to open file root://cms-xrd-global.cern.ch//store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root
%MSG-w XrdAdaptorInternal:  PreMixingModule:mixData 15-Jan-2021 11:33:19 CET  Run: 1 Event: 5604
Failed to open file at URL root://cms-xrd-global.cern.ch:1094//store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root?tried=+1098cms-xrd-global011098cms-xrd-global02.cern.ch&xrdcl.requuid=ef982b41-a2af-4e31-b7f9-83a99841010f.
%MSG
%MSG-w XrdAdaptorInternal:  PreMixingModule:mixData 15-Jan-2021 11:33:19 CET  Run: 1 Event: 5604
Failed to open file at URL root://cms-xrd-global.cern.ch:1094//store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root?tried=&xrdcl.requuid=c097d504-8ae0-4a8f-9d5c-41c1b6734bcf.
%MSG
----- Begin Fatal Exception 15-Jan-2021 11:33:19 CET-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Processing  Event run: 1 lumi: 113 event: 5604 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module SiStripDigiToRawModule/'SiStripDigiToRaw'
   [5] Calling method for module PreMixingModule/'mixData'
   [6] Calling RootInputFileSequence::initTheFile()
   [7] Calling StorageFactory::open()
   [8] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://cms-xrd-global.cern.ch//store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root'
   Additional Info:
      [a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root
      [b] Input file root://cms-xrd-global.cern.ch//store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root could not be opened.
      [c] XrdCl::File::Open(name='root://cms-xrd-global.cern.ch//store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.�
' (errno=3011, code=400). No additional data servers were found.
      [d] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root?tried=&xrdcl.requuid=c097d504-8ae0-4a8f-9d5c-41c1b6734bcf
      [e] Problematic data server: cms-xrd-global.cern.ch:1094
      [f] Disabled source: cms-xrd-global.cern.ch:1094
----- End Fatal Exception -------------------------------------------------
15-Jan-2021 11:33:19 CET  Closed file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_0/RelValTTbar_13/GEN-SIM/106X_mc2017_realistic_v3-v1/10000/3E58DF38-2AC3-B044-A9E2-D43678CA026B.root

@smuzaffar
Copy link
Contributor

looks like eos glitch otherwise file is available

[cmsbuild@lxplus702 ~]$ ls -l /eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root
-rw-r--r--. 1 cmsbuild zh 3575267387 May 31  2019 /eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_0/RelValPREMIXUP17_PU25/PREMIX/PU25ns_106X_mc2017_realistic_v3-v1/10000/0847C39A-8322-D144-8B5C-172ECD86C674.root

@jpata
Copy link
Contributor

jpata commented Jan 15, 2021

Let's rerun the test after some time (e.g. tonight?) to get the final greenlight.

@jpata
Copy link
Contributor

jpata commented Jan 18, 2021

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a29982/12313/summary.html
COMMIT: 92f3342
CMSSW: CMSSW_11_3_X_2021-01-17-2300/slc7_amd64_gcc900
Additional Tests: GPU

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2716961
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2716938
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 36 files compared)
  • Checked 156 log files, 37 edm output root files, 37 DQM output files

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a29982/12313/summary.html
COMMIT: 92f3342
CMSSW: CMSSW_11_3_X_2021-01-17-2300/slc7_amd64_gcc900
Additional Tests: GPU

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2716961
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2716938
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 36 files compared)
  • Checked 156 log files, 37 edm output root files, 37 DQM output files

@jpata
Copy link
Contributor

jpata commented Jan 18, 2021

+reconstruction

  • ECAL local reconstruction on GPUs
  • follow-up comments in the issue Open issues regarding the ECAL local reconstruction on GPU #32480
  • we don't see issues in the jenkins tests, the code runs, and does not introduce problems for existing workflows (apart from reordering trigger bits due to python dict behaviour which is OK)
  • CPU to GPU comparison workflows show differences which have not been traced down, but we accept it as it is

@silviodonato
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will be automatically merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet