Run integration tests via CUDA Execution Provider #41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This was an absolute mission to figure out how to get working correctly, but I finally have the integration tests running in Docker via the CUDA execution provider :)
Some notes on this:
Initially I couldn't run OnnxStack on my linux development machine as it would fail citing
"The ONNX Runtime extensions library was not found"
. I tried lots of stuff to get it working in my local environment and couldn't, so I decided I would file a bug report with OnnxRuntime themselves and to do so I would make a minimal reproduction of the issue in Docker. To my surprise it actually worked in Docker and I didn't get that error, so I implemented the first round of tests just using the Cpu Execution Provider.When it came time to try and get the tests running inside the container and using the GPU I was tossing up between whether it would be better to use the dotnet base image and install the drivers into the container (not desirable IMO) or to use the nvidia/cuda base image and either install dotnet sdk into that or build a standalone executable. During this I discovered that while I didn't get the
"The ONNX Runtime extensions library was not found"
error in the dotnet base image in Docker I did get it in the nvidia/cuda one!It was pretty tough to work out why though, and I tried every possible combination of shifting around NuGet package references, and changing project settings in the
.csproj
and whatnot assuming it was simply having difficulty due to a wrong setting or a package conflict somewhere. I knew that it was a runtime issue, and that it was failing to load the.so
files, but after comparing thebin/Debug
folders of the working version in the dotnet base image container, and the two failing versions in my local dev environment and the nvidia/cuda base image container I wasn't seeing any differences. So, that means it had to be an environmental difference.I'm not familiar with debugging issues calling into native code from .NET, so I asked GPT-4 for any debugging strategies that could help with this vexing problem, and it recommended running
ldd
against the native binaries to reveal what dependencies they need. So, I tried that between my 1 working, and 2 non-working environments and that revealed the following:working dotnet/sdk based container
non-working local Pop_OS 23.04 development environment
non-working nvidia/cuda based container
As you can see the two non-working environments are missing the dependency on SSL 1.1 binaries! Once I installed these into the container manually the problem went away.
Other things to note
nvidia-container-toolkit
on my host system to get the GPU passthrough to the containers working.watch -n1 nvidia-smi
while running the testsnvidia-smi
reports VRAM usage going as high as 23GB!nvidia-container-toolkit
into the host system, though presumably it works if you install that into WSL-2???All in all the tests run quite a bit faster :)