Run integration tests via CUDA Execution Provider #41

james-s-tayler · 2023-11-20T10:51:17Z

This was an absolute mission to figure out how to get working correctly, but I finally have the integration tests running in Docker via the CUDA execution provider :)

Some notes on this:

Initially I couldn't run OnnxStack on my linux development machine as it would fail citing "The ONNX Runtime extensions library was not found". I tried lots of stuff to get it working in my local environment and couldn't, so I decided I would file a bug report with OnnxRuntime themselves and to do so I would make a minimal reproduction of the issue in Docker. To my surprise it actually worked in Docker and I didn't get that error, so I implemented the first round of tests just using the Cpu Execution Provider.

When it came time to try and get the tests running inside the container and using the GPU I was tossing up between whether it would be better to use the dotnet base image and install the drivers into the container (not desirable IMO) or to use the nvidia/cuda base image and either install dotnet sdk into that or build a standalone executable. During this I discovered that while I didn't get the "The ONNX Runtime extensions library was not found" error in the dotnet base image in Docker I did get it in the nvidia/cuda one!

It was pretty tough to work out why though, and I tried every possible combination of shifting around NuGet package references, and changing project settings in the .csproj and whatnot assuming it was simply having difficulty due to a wrong setting or a package conflict somewhere. I knew that it was a runtime issue, and that it was failing to load the .so files, but after comparing the bin/Debug folders of the working version in the dotnet base image container, and the two failing versions in my local dev environment and the nvidia/cuda base image container I wasn't seeing any differences. So, that means it had to be an environmental difference.

I'm not familiar with debugging issues calling into native code from .NET, so I asked GPT-4 for any debugging strategies that could help with this vexing problem, and it recommended running ldd against the native binaries to reveal what dependencies they need. So, I tried that between my 1 working, and 2 non-working environments and that revealed the following:

working dotnet/sdk based container

Step 17/18 : RUN ldd Tests/bin/Debug/net7.0/linux-x64/libortextensions.so
 ---> Running in b00a57c0bc62
    linux-vdso.so.1 (0x00007ffdb69d7000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff07a2d4000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff07a2b2000)
    libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ff07a295000)
    libssl.so.1.1 => /usr/lib/x86_64-linux-gnu/libssl.so.1.1 (0x00007ff07a202000)
    libcrypto.so.1.1 => /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007ff079f0e000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff079d3f000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff079bfb000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff079be1000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff079a0d000)
    /lib64/ld-linux-x86-64.so.2 (0x00007ff07a7c3000)

non-working local Pop_OS 23.04 development environment

me@pop-os:~/source/cuda-playground$ ldd Tests/bin/Debug/net7.0/linux-x64/libortextensions.so 
linux-vdso.so.1 (0x00007ffc025b4000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc01766e000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc017669000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fc01764d000)
libssl.so.1.1 => not found
libcrypto.so.1.1 => not found
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc016c00000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc017564000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc017544000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc016800000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc017689000)

non-working nvidia/cuda based container

Step 17/18 : RUN ldd Tests/bin/Debug/net7.0/linux-x64/libortextensions.so
  ---> Running in a6d639d5ec48
    linux-vdso.so.1 (0x00007ffd577e2000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f014f36e000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f014f369000)
    libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f014f34d000)
    libssl.so.1.1 => not found
    libcrypto.so.1.1 => not found
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f014f11f000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f014f038000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f014f018000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f014edf0000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f014f85d000)

As you can see the two non-working environments are missing the dependency on SSL 1.1 binaries! Once I installed these into the container manually the problem went away.

Other things to note

I had to install nvidia-container-toolkit on my host system to get the GPU passthrough to the containers working.
When running watch -n1 nvidia-smi while running the tests nvidia-smi reports VRAM usage going as high as 23GB!
- I'm wondering what the deal is with this?
- It should be noted that all the tests run in sequence and not parallel, and unloading the model at the end of each test doesn't seem to affect it.
I don't know if this runs on Windows due to needing to install nvidia-container-toolkit into the host system, though presumably it works if you install that into WSL-2???

All in all the tests run quite a bit faster :)

saddam213 · 2023-11-20T19:35:58Z

So CUDA works on windows if you install CUDA 11 and the toolkit, however the VRAM usage is 2x what DirectML is, which looks like you have confirmed on Linux

Using F16 models you can get the VRAM usage down to about 11GB, but the model load time took about 40-50 seconds on windows, not 2-3 seconds like DirectML

My initial tests were that CUDA may have been 10-20% faster, however the VRAM and load delays make this meaningless IMO

I did not investigate much further as it seems DOA to me

James Tayler added 3 commits November 20, 2023 22:50

run tests in docker with cuda execution provider

3ac21b0

ensure tests run sequentially

b1d2358

add troubleshooting instructions and remove unused dependency

e2a82e1

saddam213 mentioned this pull request Nov 24, 2023

[Bug] CUDA provider disappeared in 0.9.0 #46

Closed

saddam213 merged commit 098b758 into TensorStack-AI:master Nov 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run integration tests via CUDA Execution Provider #41

Run integration tests via CUDA Execution Provider #41

james-s-tayler commented Nov 20, 2023 •

edited

Loading

saddam213 commented Nov 20, 2023 •

edited

Loading

Run integration tests via CUDA Execution Provider #41

Run integration tests via CUDA Execution Provider #41

Conversation

james-s-tayler commented Nov 20, 2023 • edited Loading

Some notes on this:

working dotnet/sdk based container

non-working local Pop_OS 23.04 development environment

non-working nvidia/cuda based container

Other things to note

saddam213 commented Nov 20, 2023 • edited Loading

james-s-tayler commented Nov 20, 2023 •

edited

Loading

saddam213 commented Nov 20, 2023 •

edited

Loading