Use Singularity for SonicTriton examples #11

kpedro88 · 2020-09-14T20:59:03Z

Currently, the SonicTriton examples in HeterogeneousCore/SonicTriton/test require Docker for standalone use (setting up a local server). Because Docker require superuser permission on Linux, it's preferable to use a Singularity container. An example of building a Singularity container for Triton can be found at lgray/triton-torchgeo-gat-example.

Assigned to: @kpedro88, @lgray

mialiu149 · 2020-09-15T20:51:35Z

tested on ucsd machine at least for tritonserver-20.06-v1-py3-geometric, I was still getting errors. running with:
TMPDIR=/scratch/data/mliu/tmp singularity instance start
-B ./artifacts/models/:/models
--hostname gattestserver --writable
tritonserver-20.06-v1-py3-geometric/ gat_test_server
TMPDIR=/scratch/data/mliu/tmp singularity run --nv instance://gat_test_server
tritonserver --model-repository=/models >& gat_test_server.log &
sleep 2
TMPDIR=/scratch/data/mliu/tmp singularity run -B pwd/client:/inputs
--disable-cache docker://nvcr.io/nvidia/tritonserver:20.06-py3-clientsdk
python /inputs/client.py -m gat_test -u localhost:8001

[error]
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
I0909 19:10:50.794980 21 metrics.cc:164] found 6 GPUs supporting NVML metrics
I0909 19:10:50.801559 21 metrics.cc:173] GPU 0: GeForce GTX 1080 Ti
I0909 19:10:50.808530 21 metrics.cc:173] GPU 1: GeForce GTX 1080 Ti
I0909 19:10:50.815280 21 metrics.cc:173] GPU 2: GeForce GTX 1080 Ti
I0909 19:10:50.822078 21 metrics.cc:173] GPU 3: GeForce GTX 1080 Ti
I0909 19:10:50.829149 21 metrics.cc:173] GPU 4: GeForce GTX 1080 Ti
I0909 19:10:50.836247 21 metrics.cc:173] GPU 5: GeForce GTX 1080 Ti
I0909 19:10:50.836589 21 server.cc:127] Initializing Triton Inference Server
E0909 19:10:52.002003 21 server.cc:168] failed to enable peer access for some device pairs
I0909 19:10:52.018406 21 server_status.cc:55] New status tracking for model 'gat_test'
I0909 19:10:52.018501 21 model_repository_manager.cc:723] loading: gat_test:1
I0909 19:10:52.026666 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu0 on GPU 0 (6.1) using model.pt
I0909 19:11:23.125336 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu1 on GPU 1 (6.1) using model.pt
I0909 19:11:53.259708 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu2 on GPU 2 (6.1) using model.pt
I0909 19:12:21.654204 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu3 on GPU 3 (6.1) using model.pt
I0909 19:12:50.234227 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu4 on GPU 4 (6.1) using model.pt
I0909 19:13:18.267221 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu5 on GPU 5 (6.1) using model.pt
I0909 19:13:46.825164 21 model_repository_manager.cc:888] successfully loaded 'gat_test' version 1
Starting endpoints, 'inference:0' listening on
I0909 19:13:46.828337 21 grpc_server.cc:1942] Started GRPCService at 0.0.0.0:8001
I0909 19:13:46.828415 21 http_server.cc:1428] Starting HTTPService at 0.0.0.0:8000
I0909 19:13:46.870846 21 http_server.cc:1443] Starting Metrics Service at 0.0.0.0:8002

And the server isn't running properly when I check with curl.

kpedro88 · 2020-09-15T21:10:24Z

@mialiu149 sleep 2 might not be long enough? Otherwise, can you clarify the specific error you observe? Most of this just looks like the standard log messages printed by the server.

lgray · 2020-09-15T21:33:34Z

@mialiu149 are you testing it from a remote machine or on that same machine?

It looks like it's bound to 0.0.0.0 rather than an external-facing ip.

mialiu149 · 2020-09-17T15:58:27Z

so the log says that server is running, and also checked with singularity.
singularity instance list
INSTANCE NAME PID IP IMAGE
gat_test_server 23824 /scratch/data/mliu/triton-torchgeo-gat-example_singularity/tritonserver-20.06-v1-py3-geometric

if I check with curl:
curl -v localhost:8000/v2/health/ready

About to connect() to localhost port 8000 (#0)
Trying ::1...
Connection refused
Trying 127.0.0.1...
Connected to localhost (127.0.0.1) port 8000 (#0)

GET /v2/health/ready HTTP/1.1
User-Agent: curl/7.29.0
Host: localhost:8000
Accept: /

< HTTP/1.1 400 Bad Request
< Content-Length: 0
< Content-Type: text/plain
<

Connection #0 to host localhost left intact

running a local client now also throws errors:
INFO: Creating SIF file...
Traceback (most recent call last):
File "/inputs/client.py", line 65, in
mconf = triton_client.get_model_config(model_name, as_json=True)
File "/usr/local/lib/python3.6/dist-packages/tritongrpcclient/init.py", line 391, in get_model_config
raise_error_grpc(rpc_error)
File "/usr/local/lib/python3.6/dist-packages/tritongrpcclient/init.py", line 49, in raise_error_grpc
raise get_error_grpc(rpc_error) from None
tritonclientutils.InferenceServerException: [StatusCode.UNIMPLEMENTED]

lgray · 2020-09-17T16:01:36Z

Ah!

tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API.
This is the tritonserver-geometric image you want to use for interacting with CMSSW.

For testing with the python scripts you want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.

mialiu149 · 2020-09-17T19:23:47Z

Buttt, the v1 would work with cmssw?

…

On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray ***@***.***> wrote: Ah! tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API. This is the tritonserver-geometric image you want to use for interacting with CMSSW. You want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A> .

mialiu149 · 2020-09-17T19:24:32Z

Ah. Yes. I will try from another machine with cmssw

…

On Thu, Sep 17, 2020 at 3:23 PM Mia Liu ***@***.***> wrote: Buttt, the v1 would work with cmssw? On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray ***@***.***> wrote: > > > Ah! > > > tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the > tests are in the version 2 API. > > > This is the tritonserver-geometric image you want to use for interacting > with CMSSW. > > > You want to use tritonserver-20.06-py3-geometric, which has the triton > api V2 server. > > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#11 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A> > . > > >

mialiu149 · 2020-09-17T19:26:43Z

Sorry, but then why does the server check was unhealthy with curl?

…

On Thu, Sep 17, 2020 at 3:24 PM Mia Liu ***@***.***> wrote: Ah. Yes. I will try from another machine with cmssw On Thu, Sep 17, 2020 at 3:23 PM Mia Liu ***@***.***> wrote: > Buttt, the v1 would work with cmssw? > > On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray ***@***.***> > wrote: > >> >> >> Ah! >> >> >> tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the >> tests are in the version 2 API. >> >> >> This is the tritonserver-geometric image you want to use for interacting >> with CMSSW. >> >> >> You want to use tritonserver-20.06-py3-geometric, which has the triton >> api V2 server. >> >> >> >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#11 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A> >> . >> >> >> > >

lgray · 2020-09-17T19:33:25Z

The first try it's attempting to establish a connection with ipv6, that fails since it's not bound, and then it tried ipv4 and succeeds.

lgray · 2020-09-17T19:35:00Z

And yeah, do python tests with py3-geometric and cmssw tests with v1-py3-geometric the torch related libraries are the same between the two so the models will work the same with either one. It's only tritonserver api that's different.

mialiu149 · 2020-09-17T21:22:47Z

worked with CMSSW from a remote client. not sure why does the server check was unhealthy with curl.

lgray · 2020-09-17T23:40:10Z

@mialiu149 it's succeeding but on the second try if you look at the logs you posted, for some reason it's trying an ipv6 address first (which isn't bound) which fails and then it tries 127.0.0.1:8000 and succeeds.

lgray · 2020-09-17T23:41:55Z

Try it again with curl -4 -v localhost:8000/v2/health/ready

kpedro88 · 2020-09-28T16:44:04Z

Progress on this issue:

Followed https://github.com/lgray/triton-torchgeo-gat-example to build Docker container w/ PyTorch libraries
Container now hosted on a FastML DockerHub account: https://hub.docker.com/repository/docker/fastml/triton-torchgeo
Submitted PR to have Docker containers from that repo automatically converted to Singularity and hosted on cvmfs: https://gitlab.cern.ch/unpacked/sync/-/merge_requests/58 (the automatic singularity build command is very similar to @lgray's repo: https://github.com/cvmfs/cvmfs/blob/ff7728530936f3ef93bd5578cd9933bdc480be81/ducc/lib/image.go#L358)
Combined all commands and options into a single script: cms-sw/cmssw@master...kpedro88:SonicTriton4

@lgray @mialiu149 let me know if you have any feedback before I submit the PR. The cmslpcgpu nodes are a good place to test both CPU and GPU modes (CPU mode can't be tested on normal cmslpc, because the AMD Opterons don't support AVX).

kpedro88 · 2020-09-29T18:01:04Z

See: cms-sw/cmssw#31616

kpedro88 · 2020-10-07T21:26:01Z

Now merged.

kpedro88 closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Singularity for SonicTriton examples #11

Use Singularity for SonicTriton examples #11

kpedro88 commented Sep 14, 2020

mialiu149 commented Sep 15, 2020

kpedro88 commented Sep 15, 2020

lgray commented Sep 15, 2020

mialiu149 commented Sep 17, 2020

lgray commented Sep 17, 2020 •

edited

mialiu149 commented Sep 17, 2020 via email

mialiu149 commented Sep 17, 2020 via email

mialiu149 commented Sep 17, 2020 via email

lgray commented Sep 17, 2020

lgray commented Sep 17, 2020

mialiu149 commented Sep 17, 2020

lgray commented Sep 17, 2020 •

edited

lgray commented Sep 17, 2020

kpedro88 commented Sep 28, 2020

kpedro88 commented Sep 29, 2020

kpedro88 commented Oct 7, 2020

Use Singularity for SonicTriton examples #11

Use Singularity for SonicTriton examples #11

Comments

kpedro88 commented Sep 14, 2020

mialiu149 commented Sep 15, 2020

kpedro88 commented Sep 15, 2020

lgray commented Sep 15, 2020

mialiu149 commented Sep 17, 2020

lgray commented Sep 17, 2020 • edited

mialiu149 commented Sep 17, 2020 via email

mialiu149 commented Sep 17, 2020 via email

mialiu149 commented Sep 17, 2020 via email

lgray commented Sep 17, 2020

lgray commented Sep 17, 2020

mialiu149 commented Sep 17, 2020

lgray commented Sep 17, 2020 • edited

lgray commented Sep 17, 2020

kpedro88 commented Sep 28, 2020

kpedro88 commented Sep 29, 2020

kpedro88 commented Oct 7, 2020

lgray commented Sep 17, 2020 •

edited

lgray commented Sep 17, 2020 •

edited