Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tensorflow] Build with GPU enabled #7648

Merged
merged 3 commits into from Feb 13, 2023
Merged

Conversation

smuzaffar
Copy link
Contributor

No description provided.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_12_3_X/master.

@cmsbuild, @smuzaffar, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @qliphy you are the release manager for this.
cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

test parameters:

  • enable_test = gpu,threading,profiling

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-GPU
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/22659/summary.html
COMMIT: 85243c5
CMSSW: CMSSW_12_3_X_2022-02-24-1100/slc7_amd64_gcc10
Additional Tests: GPU,THREADING,PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/22659/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/22659/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/22659/git-merge-result

RelVals-GPU

  • 11634.51211634.512_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
  • 11634.52211634.522_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
  • 11634.50611634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 4001143
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4001119
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor Author

FYI @fwyzard @tvami , This PR enabled building TF with GPU

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@tvami
Copy link

tvami commented Feb 25, 2022

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@fwyzard
Copy link
Contributor

fwyzard commented Feb 28, 2022

  • 11634.50611634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log

This works for me (I did not try the other two):

fwyzard@gputest-milan-01.cms:/data/user/fwyzard/test-TensorFlow-on-GPU/CMSSW_12_3_X_2022-02-24-1100/src$ cudaComputeCapabilities 
   0     7.5    Tesla T4
   1     7.5    Tesla T4

fwyzard@gputest-milan-01.cms:/data/user/fwyzard/test-TensorFlow-on-GPU/CMSSW_12_3_X_2022-02-24-1100/src$ CUDA_VISIBLE_DEVICES=0 runTheMatrix.py -w upgrade -l 11634.506
ignoring non-requested file relval_standard
ignoring non-requested file relval_highstats
ignoring non-requested file relval_pileup
ignoring non-requested file relval_generator
ignoring non-requested file relval_extendedgen
ignoring non-requested file relval_production
ignoring non-requested file relval_ged
processing relval_upgrade
ignoring non-requested file relval_cleanedupgrade
ignoring non-requested file relval_gpu
ignoring non-requested file relval_2017
ignoring non-requested file relval_2026
ignoring non-requested file relval_identity
ignoring non-requested file relval_machine
ignoring non-requested file relval_premix
Running up to 4 concurrent jobs, each with 1 thread per process

Preparing to run 11634.506 TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano

# in: /data/user/fwyzard/test-TensorFlow-on-GPU/CMSSW_12_3_X_2022-02-24-1100/src going to execute cd 11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano
 cmsDriver.py TTbar_14TeV_TuneCP5_cfi  -s GEN,SIM -n 10 --conditions auto:phase1_2021_realistic --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Extended --era Run3 --relval 9000,100 --fileout file:step1.root  > step1_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log  2>&1
 

# in: /data/user/fwyzard/test-TensorFlow-on-GPU/CMSSW_12_3_X_2022-02-24-1100/src going to execute cd 11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano
 cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions auto:phase1_2021_realistic --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --procModifiers gpu --customise HLTrigger/Configuration/customizeHLTforPatatrack.enablePatatrackPixelTriplets --filein  file:step1.root  --fileout file:step2.root  > step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log  2>&1
 

# in: /data/user/fwyzard/test-TensorFlow-on-GPU/CMSSW_12_3_X_2022-02-24-1100/src going to execute cd 11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano
 cmsDriver.py step3  -s RAW2DIGI:RawToDigi_pixelOnly,RECO:reconstruction_pixelTrackingOnly,VALIDATION:@pixelTrackingOnlyValidation,DQM:@pixelTrackingOnlyDQM --conditions auto:phase1_2021_realistic --datatier GEN-SIM-RECO,DQMIO -n 10 --eventcontent RECOSIM,DQM --geometry DB:Extended --era Run3 --procModifiers pixelNtupletFit,gpu --customise RecoPixelVertexing/Configuration/customizePixelTracksForTriplets.customizePixelTracksForTriplets --filein  file:step2.root  --fileout file:step3.root  > step3_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log  2>&1
 

# in: /data/user/fwyzard/test-TensorFlow-on-GPU/CMSSW_12_3_X_2022-02-24-1100/src going to execute cd 11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano
 cmsDriver.py step4  -s HARVESTING:@trackingOnlyValidation+@pixelTrackingOnlyDQM --conditions auto:phase1_2021_realistic --mc  --geometry DB:Extended --scenario pp --filetype DQM --era Run3 -n 100  --filein file:step3_inDQM.root --fileout file:step4.root  > step4_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log  2>&1
 
11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Mon Feb 28 14:18:29 2022-date Mon Feb 28 14:13:25 2022; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

fwyzard@gputest-milan-01.cms:/data/user/fwyzard/test-TensorFlow-on-GPU/CMSSW_12_3_X_2022-02-24-1100/src$ cat 11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
%MSG-i CUDAService:  (NoModuleName) 28-Feb-2022 14:16:27 CET pre-events
CUDA runtime version 11.4, driver version 11.6, NVIDIA driver version 510.39.01
CUDA device 0: Tesla T4 (sm_75)
%MSG
28-Feb-2022 14:16:32 CET  Initiating request to open file file:step1.root
28-Feb-2022 14:16:33 CET  Successfully opened file file:step1.root
2022-02-28 14:16:39.412607: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-28 14:16:39.785240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13600 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:41:00.0, compute capability: 7.5
2022-02-28 14:16:39.851510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13600 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:41:00.0, compute capability: 7.5
2022-02-28 14:16:39.861919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13600 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:41:00.0, compute capability: 7.5
2022-02-28 14:16:39.870629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13600 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:41:00.0, compute capability: 7.5
2022-02-28 14:16:40.488995: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8202
PersistencyIO    INFO  +++ Set Streamer to dd4hep::OpaqueDataBlock
DD4hep           WARN  ++ Using globally Geant4 unit system (mm,ns,MeV)
DD4CMS           INFO  +++ Processing the CMS detector description xml-memory-buffer
Detector         INFO  *********** Created World volume with size: 101000 101000 450000
Detector         INFO  +++ Patching names of anonymous shapes....
DDDefinition     INFO  +++ Finished processing xml-memory-buffer
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 28-Feb-2022 14:16:55.496 CET
2022-02-28 14:17:00.334038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13600 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:41:00.0, compute capability: 7.5
#--------------------------------------------------------------------------
#                         FastJet release 3.4.0
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#                                                                             
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#                                                                       
# FastJet is provided without warranty under the GNU GPL v2 or higher.  
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code
# and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------
%MSG-e TkDetLayers:   TrackerRecoGeometryESProducer:TrackerRecoGeometryESProducer@callESModule  28-Feb-2022 14:17:00 CET Run: 1 Event: 1
 ForwardDiskSectorBuilderFromDet: Trying to build Petal Wedge from Dets at different z positions !! Delta_z = -0.951241
%MSG
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:03.295 CET
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:05.265 CET
Begin processing the 4th record. Run 1, Event 4, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:07.719 CET
Begin processing the 5th record. Run 1, Event 5, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:09.384 CET
Begin processing the 6th record. Run 1, Event 6, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:11.338 CET
Begin processing the 7th record. Run 1, Event 7, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:12.831 CET
Begin processing the 8th record. Run 1, Event 8, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:14.200 CET
Begin processing the 9th record. Run 1, Event 9, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:15.457 CET
Begin processing the 10th record. Run 1, Event 10, LumiSection 1 on stream 0 at 28-Feb-2022 14:17:17.310 CET
28-Feb-2022 14:17:18 CET  Closed file file:step1.root

=============================================

MessageLogger Summary

 type     category        sev    module        subroutine        count    total
 ---- -------------------- -- ---------------- ----------------  -----    -----
    1 TkDetLayers          -e TrackerRecoGeome                       1        1
    2 fileAction           -s file_close                             1        1
    3 fileAction           -s file_open                              2        2

 type    category    Examples: run/evt        run/evt          run/evt
 ---- -------------------- ---------------- ---------------- ----------------
    1 TkDetLayers          1/1                               
    2 fileAction           End Run: 1                        
    3 fileAction           pre-events       pre-events       

Severity    # Occurrences   Total Occurrences
--------    -------------   -----------------
Error                   1                   1
System                  3                   3

dropped waiting message count 0

@smuzaffar
Copy link
Contributor Author

please test for slc7_ppc64le_gcc11

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2022

-1

Failed Tests: UnitTests RelVals RelVals-THREADING AddOn
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/22764/summary.html
COMMIT: 85243c5
CMSSW: CMSSW_12_3_X_2022-03-01-2300/slc7_ppc64le_gcc11
Additional Tests: GPU,THREADING,PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7648/22764/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/22764/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/22764/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testTFGraphLoading had ERRORS
---> test testTFMetaGraphLoading had ERRORS
---> test testTFThreadPools had ERRORS
---> test DRNTest had ERRORS
and more ...

RelVals

----- Begin Fatal Exception 02-Mar-2022 14:52:40 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 301998 lumi: 9 event: 9312468 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module L1TDigiToRaw/'packGmtStage2'
   [5] Calling method for module L1TMuonEndCapTrackProducer/'simEmtfDigis'
Exception Message:
error while running session: Internal: 2 root error(s) found.
  (0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
	 [[dense_4/BiasAdd/_9]]
  (1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 02-Mar-2022 14:52:41 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 274199 lumi: 21 event: 39389360 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module L1TDigiToRaw/'packGmtStage2'
   [5] Calling method for module L1TMuonEndCapTrackProducer/'simEmtfDigis'
Exception Message:
error while running session: Internal: 2 root error(s) found.
  (0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
	 [[dense_4/BiasAdd/_9]]
  (1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 02-Mar-2022 14:52:40 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 319450 lumi: 76 event: 105867030 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module L1TDigiToRaw/'packGmtStage2'
   [5] Calling method for module L1TMuonEndCapTrackProducer/'simEmtfDigis'
Exception Message:
error while running session: Internal: 2 root error(s) found.
  (0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
	 [[dense_4/BiasAdd/_9]]
  (1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.
----- End Fatal Exception -------------------------------------------------
Expand to see more relval errors ...

RelVals-THREADING

  • 136.731136.731_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2_skimSinglePh/step3_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2_skimSinglePh.log
  • 7.37.3_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18/step2_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18.log
  • 136.793136.793_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017_skimDoubleEG/step2_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017_skimDoubleEG.log
Expand to see more relval errors ...

AddOn Tests

----- Begin Fatal Exception 02-Mar-2022 15:37:15 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 2 stream: 1
   [1] Running path 'L1TAnalyzerEndpath'
   [2] Prefetching for module L1TGlobalSummary/'L1TGlobalSummary'
   [3] Prefetching for module L1TGlobalProducer/'simGtStage2Digis'
   [4] Prefetching for module L1TMuonProducer/'simGmtStage2Digis'
   [5] Calling method for module L1TMuonEndCapTrackProducer/'simEmtfDigis'
Exception Message:
error while running session: Internal: 2 root error(s) found.
  (0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
	 [[dense_4/BiasAdd/_9]]
  (1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 02-Mar-2022 15:39:55 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 2 stream: 2
   [1] Running path 'L1TAnalyzerEndpath'
   [2] Prefetching for module L1TGlobalSummary/'L1TGlobalSummary'
   [3] Prefetching for module L1TGlobalProducer/'simGtStage2Digis'
   [4] Prefetching for module L1TMuonProducer/'simGmtStage2Digis'
   [5] Calling method for module L1TMuonEndCapTrackProducer/'simEmtfDigis'
Exception Message:
error while running session: Internal: 2 root error(s) found.
  (0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
	 [[dense_4/BiasAdd/_9]]
  (1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node dense_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 02-Mar-2022 15:42:58 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 1
   [1] Running path 'HLT_DoubleMediumDeepTauIsoPFTauHPS35_L2NN_eta2p1_v1'
   [2] Calling method for module L2TauNNProducer/'hltL2TauTagNNProducer'
Exception Message:
error while running session: Internal: 2 root error(s) found.
  (0) Internal: Blas xGEMV launch failed : a.shape=[1,3,20], b.shape=[1,20,1], m=3, n=1, k=20
	 [[{{node cnn_model/StatefulPartitionedCall/StatefulPartitionedCall/output/MatMul}}]]
	 [[Identity/_7]]
  (1) Internal: Blas xGEMV launch failed : a.shape=[1,3,20], b.shape=[1,20,1], m=3, n=1, k=20
	 [[{{node cnn_model/StatefulPartitionedCall/StatefulPartitionedCall/output/MatMul}}]]
0 successful operations.
0 derived errors ignored.
----- End Fatal Exception -------------------------------------------------
Expand to see more addon errors ...

@smuzaffar smuzaffar changed the base branch from IB/CMSSW_12_3_X/master to IB/CMSSW_12_4_X/master March 11, 2022 08:15
@tvami
Copy link

tvami commented Apr 22, 2022

please test for slc7_ppc64le_gcc11

@tvami
Copy link

tvami commented Apr 25, 2022

Are the tests stuck for 3 days now?

@tvami
Copy link

tvami commented Apr 25, 2022

@cmsbuild , please abort

@tvami
Copy link

tvami commented Apr 25, 2022

please test for slc7_ppc64le_gcc11

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2023

-1

Failed Tests: UnitTests RelVals AddOn
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30468/summary.html
COMMIT: 86ea719
CMSSW: CMSSW_13_0_X_2023-02-06-2300/el8_ppc64le_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/30468/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30468/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30468/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test TestDQMOnlineClient-dt_dqm_sourceclient had ERRORS
---> test TestDQMOnlineClient-visualization_secondInstance had ERRORS
---> test TestDQMOnlineClient-visualization had ERRORS
---> test TestDQMOnlineClient-dt4ml_dqm_sourceclient had ERRORS
and more ...

RelVals

----- Begin Fatal Exception 07-Feb-2023 20:32:23 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 301998 lumi: 9 event: 9312468 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module L1TDigiToRaw/'packGmtStage2'
   [5] Calling method for module L1TMuonEndCapTrackProducer/'simEmtfDigis'
Exception Message:
error while running session: Internal: 2 root error(s) found.
  (0) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node sequential_3/dense_7/MatMul}}]]
	 [[Identity/_3]]
  (1) Internal: Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node sequential_3/dense_7/MatMul}}]]
0 successful operations.
0 derived errors ignored.
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 07-Feb-2023 20:32:35 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 319450 lumi: 76 event: 105867030 stream: 0
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module L1TDigiToRaw/'packGmtStage2'
   [5] Calling method for module L1TMuonEndCapTrackProducer/'simEmtfDigis'
Exception Message:
error while running session: Resource exhausted: OOM when allocating tensor of shape [23,20] and type float
	 [[{{node sequential_3/dense_7/MatMul/ReadVariableOp/resource}}]]
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 07-Feb-2023 20:34:09 CET-----------------------
An exception of category 'InvalidSession' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=DeepTauId label='deepTau2017v2p1ForMini'
Exception Message:
error while creating session: Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
----- End Fatal Exception -------------------------------------------------
Expand to see more relval errors ...

AddOn Tests

----- Begin Fatal Exception 07-Feb-2023 20:36:44 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 3 stream: 0
   [1] Running path 'DQM_PixelReconstruction_v4'
   [2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.

/scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-06-2300/src/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu, line 63:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 07-Feb-2023 20:39:52 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 4 stream: 0
   [1] Running path 'DQM_PixelReconstruction_v4'
   [2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.

/scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-06-2300/src/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu, line 63:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 07-Feb-2023 20:36:14 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'DQM_HIPixelReconstruction_v3'
   [2] Calling method for module CAHitNtupletCUDAPhase1/'hltPixelTracksGPU'
Exception Message:
A std::exception was thrown.

/scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-06-2300/src/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu, line 63:
cudaCheck(cudaGetLastError());
cudaErrorMemoryAllocation: out of memory
----- End Fatal Exception -------------------------------------------------
Expand to see more addon errors ...

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30467/summary.html
COMMIT: 86ea719
CMSSW: CMSSW_13_0_X_2023-02-07-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/30467/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30467/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30467/git-merge-result

Comparison Summary

Summary:

  • You potentially added 92 lines to the logs
  • Reco comparison results: 2920 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555852
  • DQMHistoTests: Total failures: 572
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555258
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: found differences in 1 / 47 workflows

GPU Comparison Summary

Summary:

  • You potentially removed 135 lines from the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 5
  • DQMHistoTests: Total histograms compared: 35796
  • DQMHistoTests: Total failures: 138
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 35658
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 4 files compared)
  • Checked 16 log files, 12 edm output root files, 5 DQM output files
  • TriggerResults: found differences in 4 / 4 workflows

@valsdav
Copy link
Contributor

valsdav commented Feb 8, 2023

I'm trying to understand why I still see CUDA out of memory issues from Tensorflow in el8_ppc64le_gcc11 tests. There can be another use of Tensorflow which I didn't found in cmssw and where the backend is not correctly setup.. Investigating.

Instead the tests for el8_amd64_gcc11 and el8_aarch64_gcc11 are fine, but I see some difference. Are those coming from other PRs?

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Feb 8, 2023

humm, looks like ppc64le were not run using the cms-sw/cmssw#40551 . I have updated #7648 (comment) so that pr test always use cmssw pr

@smuzaffar
Copy link
Contributor Author

please test with cms-sw/cmssw#40551 for el8_aarch64_gcc11

@smuzaffar
Copy link
Contributor Author

please test with cms-sw/cmssw#40551 for el8_ppc64le_gcc11

@smuzaffar
Copy link
Contributor Author

please test with cms-sw/cmssw#40551

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 8, 2023

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 8, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30515/summary.html
COMMIT: 86ea719
CMSSW: CMSSW_13_0_X_2023-02-08-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/30515/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30515/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30515/git-merge-result

Comparison Summary

Summary:

  • You potentially added 114 lines to the logs
  • Reco comparison results: 27 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555852
  • DQMHistoTests: Total failures: 37
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555793
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 5
  • DQMHistoTests: Total histograms compared: 35796
  • DQMHistoTests: Total failures: 120
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 35676
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 4 files compared)
  • Checked 16 log files, 12 edm output root files, 5 DQM output files
  • TriggerResults: found differences in 3 / 4 workflows

@valsdav
Copy link
Contributor

valsdav commented Feb 9, 2023

please test with cms-sw/cmssw#40551 for el8_ppc64le_gcc11

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2023

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30534/summary.html
COMMIT: 86ea719
CMSSW: CMSSW_13_0_X_2023-02-08-2300/el8_ppc64le_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/30534/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30534/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30534/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testTFHelloWorldCUDA had ERRORS

@valsdav
Copy link
Contributor

valsdav commented Feb 9, 2023

I don't get how the simple unittest can saturate the CUDA memory. Is there maybe something wrong with the machine?

CUDA service enabled: 1
Testing CUDA backend
2023-02-09 10:14:42.793035: I tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-08-2300/test/el8_ppc64le_gcc11/8d17-e9ff-9aa2-f77c/simplegraph
2023-02-09 10:14:42.794513: I tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve }
2023-02-09 10:14:42.794617: I tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-08-2300/test/el8_ppc64le_gcc11/8d17-e9ff-9aa2-f77c/simplegraph
2023-02-09 10:14:42.945401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13239 MB memory:  -> device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0002:01:00.0, compute capability: 6.0
2023-02-09 10:14:42.994297: I tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle.
2023-02-09 10:14:43.032018: I tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 239050 microseconds.
Tensor<type: float shape: [1,1] values: [46]>
terminate called after throwing an instance of 'std::runtime_error'
  what():  
/scratch/cmsbuild/jenkins_a/workspace/build-any-ib/w/tmp/BUILDROOT/43eacf8b0441396e668a404e7b12a378/opt/cmssw/el8_ppc64le_gcc11/cms/cmssw/CMSSW_13_0_X_2023-02-08-2300/src/HeterogeneousCore/CUDAServices/src/CUDAService.cc, line 404:
cudaCheck(cudaDeviceSynchronize());
cudaErrorMemoryAllocation: out of memory

timeout: the monitored command dumped core
/bin/sh: line 1: 57271 Aborted                 timeout 3600 sh -c 'testTFHelloWorldCUDA '

@valsdav
Copy link
Contributor

valsdav commented Feb 10, 2023

please test with cms-sw/cmssw#40551 for el8_ppc64le_gcc11

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30570/summary.html
COMMIT: 86ea719
CMSSW: CMSSW_13_0_X_2023-02-09-2300/el8_ppc64le_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/30570/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30570/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-5fdcc1/30570/git-merge-result

@smuzaffar smuzaffar changed the base branch from IB/CMSSW_13_0_X/master to IB/CMSSW_13_1_X/master February 11, 2023 11:51
@valsdav
Copy link
Contributor

valsdav commented Feb 13, 2023

I am planning to introduce some improvements in the organization of the TF options we are exposing to the users, but I think that can be done in a separate PR.

The changes in the PR allow all the tests to pass (with backend::cpu by default). We will provide a test in the runTheMatrix with a GPU activated workflow.

Do you think we can merge this one? Thanks!

@perrotta
Copy link
Contributor

+1

@perrotta
Copy link
Contributor

merge

@cmsbuild cmsbuild merged commit 005bae3 into IB/CMSSW_13_1_X/master Feb 13, 2023
@smuzaffar smuzaffar deleted the smuzaffar-patch-1 branch February 15, 2023 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants