Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Address Sanitizer in CI #19073

Merged
merged 62 commits into from
Jan 12, 2024
Merged

Enable Address Sanitizer in CI #19073

merged 62 commits into from
Jan 12, 2024

Conversation

snnn
Copy link
Member

@snnn snnn commented Jan 10, 2024

Description

  1. Add two build jobs for enabling Address Sanitizer in CI. One for Windows CPU, One for Linux CPU.
  2. Set default compiler flags/linker flags in build.py for normal Windows/Linux/MacOS build. This can help control compiler flags in a more centralized way.
  3. All Windows binaries in our official packages will be built with "/PROFILE" flag. Symbols of onnxruntime.dll can be found at Microsoft public symbol server.

Limitations:

  1. On Linux Address Sanitizer ignores RPATH settings in ELF binaries. Therefore once Address Sanitizer is enabled, before running tests we need to manually set LD_LIBRARY_PATH properly otherwise libonnxruntime.so may not be able to find custom ops and shared EPs.
  2. On Linux we also need to set LD_PRELOAD before running some tests(if the main executable, like python, is not built with address sanitizer. On Windows we do not need to.
  3. On Windows before running python tests we should manually copy address sanitizer DLL to the onnxruntime/capi directory, because python 3.8 and above has enabled "Safe DLL Search Mode" that wouldn't use the information provided by PATH env.
  4. On Linux Address Sanitizer found a lot of memory leaks from our python binding code. Therefore right now we cannot enable Address Sanitizer when building ONNX Runtime with python binding.
  5. Address Sanitizer itself uses a lot of memory address space and delays memory deallocations, which is easy to cause OOM issues in 32-bit applications. We cannot run all the tests in onnxruntime_test_all in 32-bit mode with Address Sanitizer due to this reason. However, we still can run individual tests in such a way. We just cannot run all of them in one process.

Motivation and Context

To catch memory issues.

@snnn snnn marked this pull request as ready for review January 10, 2024 06:53
@snnn snnn merged commit 0e8d4c3 into main Jan 12, 2024
168 of 176 checks passed
@snnn snnn deleted the snnn/santi branch January 12, 2024 15:24
snnn added a commit that referenced this pull request Jan 14, 2024
…EP (#19124)

### Description
Set default flags nvcc and do not set the flags for ROCM EP. 


### Motivation and Context
1. To meet a BinSkim requirement for CUDA EP.

https://github.com/microsoft/binskim/blob/main/docs/BinSkimRules.md#rule-BA2024EnableSpectreMitigations

2. The ROCM EP's pipeline is broken since PR #19073 . Unit tests failed
to load the EP with the following error message:

Failed to load library libonnxruntime_providers_rocm.so with error:
/build/Release/libonnxruntime_providers_rocm.so: undefined symbol:
vtable for onnxruntime::InsertMaxPoolOutput .

This PR is a hot fix to bring the pipeline back. So far I don't know why
the error happened. The symbol "InsertMaxPoolOutput" is in
onnxruntime_optimizers. I don't see any EP code references it directly.
mszhanyi pushed a commit that referenced this pull request Jan 15, 2024
…EP (#19124)

### Description
Set default flags nvcc and do not set the flags for ROCM EP. 


### Motivation and Context
1. To meet a BinSkim requirement for CUDA EP.

https://github.com/microsoft/binskim/blob/main/docs/BinSkimRules.md#rule-BA2024EnableSpectreMitigations

2. The ROCM EP's pipeline is broken since PR #19073 . Unit tests failed
to load the EP with the following error message:

Failed to load library libonnxruntime_providers_rocm.so with error:
/build/Release/libonnxruntime_providers_rocm.so: undefined symbol:
vtable for onnxruntime::InsertMaxPoolOutput .

This PR is a hot fix to bring the pipeline back. So far I don't know why
the error happened. The symbol "InsertMaxPoolOutput" is in
onnxruntime_optimizers. I don't see any EP code references it directly.
snnn added a commit that referenced this pull request Jan 29, 2024
### Description

In PR #19073 I mistunderstood the value of "--parallel". Instead of
testing if args.parallel is None or not , I should test the returned
value of number_of_parallel_jobs function.

If build.py was invoked without --parallel, then args.parallel equals to
1. Because it is the default value. Then we should not add "/MP".
However, the current code adds it. Because if `args.paralllel` is
evaluated to `if 1` , which is True.
If build.py was invoked with --parallel with additional numbers, then
args.parallel equals to 0. Because it is unspecified. Then we should add
"/MP". However, the current code does not add it. Because `if
args.paralllel` is evaluated to `if 0` , which is False.

This also adds a new build flag: use_binskim_compliant_compile_flags, which is intended to be only used in ONNX Runtime team's build pipelines for compliance reasons. 

### Motivation and Context
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants