Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport] Improve handling of TF CUDA tests for 14_0_X #44375

Merged
merged 2 commits into from
Jul 2, 2024

Conversation

valsdav
Copy link
Contributor

@valsdav valsdav commented Mar 12, 2024

PR description:

This PR improves the handling of CUDA unit tests for the TensorFlow package, using a new tf_cuda_support tool from scram , which checks if the GPU support is enabled in TensorFlow compilation.

The PR also makes the TF cuda tests more strict by checking explicitely if a CUDA device is visible to TF and not only to cmssw.

The test testTFVisibleDevicesCUDA is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The other testTF*CUDA tests were passing silently using the CPU to run the test. After this PR all the TF sessions using tf::backend::cuda , but not finding a GPU will fail explicitly.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @valsdav for CMSSW_14_0_X.

It involves the following packages:

  • PhysicsTools/TensorFlow (ml)

@valsdav, @wpmccormack, @cmsbuild can you please review it and eventually sign? Thanks.
@makortel, @riga this is something you requested to watch as well.
@sextonkennedy, @rappoccio, @antoniovilela you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 12, 2024

cms-bot internal usage

@antoniovilela
Copy link
Contributor

hold

  • As discussed at ORP.

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @antoniovilela
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label Mar 12, 2024
@valsdav valsdav changed the title Disabling TensorFlow CUDA tests for 14_0_X [backport] Improve handling of TF CUDA tests for 14_0_X Mar 18, 2024
@valsdav valsdav force-pushed the tensorflow-disable-gpu-tests branch from a8063f6 to fcc3e4f Compare March 18, 2024 20:36
@cmsbuild
Copy link
Contributor

Pull request #44375 was updated. @valsdav, @cmsbuild, @wpmccormack can you please check and sign again.

@valsdav
Copy link
Contributor Author

valsdav commented Mar 18, 2024

enable gpu

@valsdav
Copy link
Contributor Author

valsdav commented Mar 18, 2024

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b349d4/38238/summary.html
COMMIT: fcc3e4f
CMSSW: CMSSW_14_0_X_2024-03-18-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44375/38238/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 27 lines to the logs
  • Reco comparison results: 39 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3346212
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3346190
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 205 log files, 166 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 183
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39557
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@valsdav
Copy link
Contributor Author

valsdav commented Mar 22, 2024

+ml

technical. Avoid running cuda tests if tensorflow is not compiled for Cuda.
Making sure cuda tests fail if TF does not recognize the card.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 20, 2024

hold

* As discussed at ORP.

@antoniovilela I don't remember any more why this was put on hold.

Do you think we can proceed with it ?
It seems to be blocking #45143.

@cmsbuild
Copy link
Contributor

REMINDER @antoniovilela, @rappoccio, @sextonkennedy: This PR was tested with #45143, please check if they should be merged together

@fwyzard
Copy link
Contributor

fwyzard commented Jun 25, 2024

enable gpu

@fwyzard
Copy link
Contributor

fwyzard commented Jun 25, 2024

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b349d4/40076/summary.html
COMMIT: fcc3e4f
CMSSW: CMSSW_14_0_X_2024-06-25-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44375/40076/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 48 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39744
  • DQMHistoTests: Total failures: 1821
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 37923
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@missirol
Copy link
Contributor

Do you think we can proceed with it ? It seems to be blocking #45143.

Just to mention that #45143 was merged a few days ago, but this backport is still on hold.

@antoniovilela
Copy link
Contributor

unhold

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 2, 2024

This pull request is fully signed and it will be integrated in one of the next CMSSW_14_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_14_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit d7876a0 into cms-sw:CMSSW_14_0_X Jul 2, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants