Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU support for ONNXRuntime #36963

Merged
merged 1 commit into from Apr 25, 2022
Merged

Add GPU support for ONNXRuntime #36963

merged 1 commit into from Apr 25, 2022

Conversation

hqucms
Copy link
Contributor

@hqucms hqucms commented Feb 14, 2022

PR description:

This is a technical PR that allows the GPU support of ONNXRuntime (cms-sw/cmsdist#6776) to be easily enabled in CMSSW.

Example usage:

  using namespace cms::Ort;
  std::string model_path = edm::FileInPath("PhysicsTools/ONNXRuntime/test/data/model.onnx").fullPath();
  auto session_options = ONNXRuntime::defaultSessionOptions(/*use_cuda=*/true);
  ONNXRuntime rt(model_path, &session_options);

PR validation:

Unit tests are updated to cover all three options.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36963/28320

  • This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36963/28323

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @hqucms (Huilin Qu) for master.

It involves the following packages:

  • PhysicsTools/ONNXRuntime (reconstruction)

@jpata, @cmsbuild, @clacaputo, @slava77 can you please review it and eventually sign? Thanks.
@riga this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@tvami
Copy link
Contributor

tvami commented Feb 14, 2022

@cmsbuild , please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a42efb/22413/summary.html
COMMIT: 3ef28ff
CMSSW: CMSSW_12_3_X_2022-02-14-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36963/22413/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 5 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3764435
  • DQMHistoTests: Total failures: 13
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3764399
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 45 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 193 log files, 42 edm output root files, 46 DQM output files
  • TriggerResults: no differences found

@jpata
Copy link
Contributor

jpata commented Feb 15, 2022

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jpata
Copy link
Contributor

jpata commented Feb 15, 2022

Neat! I'm curious:

  • how complicated would it be to extend to the Async mode in CMSSW?
  • now it grabs the whole GPU, how difficult would it be to extend to use streams?

if (gpu_mode == force_gpu) {
throw cms::Exception("RuntimeError") << "No GPU detected, cannot run ONNXRuntime on GPU.";
} else {
std::cout << "[ONNXRuntime] No GPU detected, will run on CPU." << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use std::cout, as it cannot be silenced from the job configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would edm::LogInfo work?

Copy link
Contributor

@fwyzard fwyzard Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@hqucms
Copy link
Contributor Author

hqucms commented Feb 15, 2022

  • how complicated would it be to extend to the Async mode in CMSSW?

I have never tried to use the async mode in CMSSW -- can you point me to any documentation or example?

  • now it grabs the whole GPU, how difficult would it be to extend to use streams?

It does not grabs the whole GPU, so in principle it can be used in multiple streams simultaneously I think (to be tested of course).

Comment on lines +135 to +119
if (input_dims[0] != batch_size) {
throw cms::Exception("RuntimeError") << "The first element of `input_shapes` (" << input_dims[0]
<< ") does not match the given `batch_size` (" << batch_size << ")";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is unrelated to the use of GPUs, isn't it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's unrelated to GPU -- just a helpful error message for users.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 15, 2022

What NVIDIA GPU architectures (SM 6.0, 7.0, 7.5, etc.) does ONNX support ?
Is it tied to how we use CUDA in CMSSW, or is it independent from it ?

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36963/29454

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

Pull request #36963 was updated. @makortel, @slava77, @clacaputo, @cmsbuild, @fwyzard, @jpata can you please check and sign again.

@hqucms
Copy link
Contributor Author

hqucms commented Apr 22, 2022

@makortel The enum class is used implemented in 5eb31fb.
Sorry for the long delay, this somehow slipped through the cracks...

@tvami
Copy link
Contributor

tvami commented Apr 22, 2022

@cmsbuild , please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a42efb/24133/summary.html
COMMIT: 5eb31fb
CMSSW: CMSSW_12_4_X_2022-04-22-1100/slc7_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/36963/24133/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19874
  • DQMHistoTests: Total failures: 1195
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 18679
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-a42efb/39434.75_TTbar_14TeV+2026D88_HLT75e33+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HLT75e33+HARVESTGlobal

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3695434
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3695410
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 205 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

+1

Thanks @hqucms

@jpata
Copy link
Contributor

jpata commented Apr 25, 2022

+reconstruction

  • technical, readiness for CUDA use in ONNXRuntime
  • no changes

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Apr 25, 2022

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants