Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Singularity for SonicTriton examples #11

Closed
kpedro88 opened this issue Sep 14, 2020 · 16 comments
Closed

Use Singularity for SonicTriton examples #11

kpedro88 opened this issue Sep 14, 2020 · 16 comments

Comments

@kpedro88
Copy link
Collaborator

Currently, the SonicTriton examples in HeterogeneousCore/SonicTriton/test require Docker for standalone use (setting up a local server). Because Docker require superuser permission on Linux, it's preferable to use a Singularity container. An example of building a Singularity container for Triton can be found at lgray/triton-torchgeo-gat-example.

Assigned to: @kpedro88, @lgray

@mialiu149
Copy link

tested on ucsd machine at least for tritonserver-20.06-v1-py3-geometric, I was still getting errors. running with:
TMPDIR=/scratch/data/mliu/tmp singularity instance start
-B ./artifacts/models/:/models
--hostname gattestserver --writable
tritonserver-20.06-v1-py3-geometric/ gat_test_server
TMPDIR=/scratch/data/mliu/tmp singularity run --nv instance://gat_test_server
tritonserver --model-repository=/models >& gat_test_server.log &
sleep 2
TMPDIR=/scratch/data/mliu/tmp singularity run -B pwd/client:/inputs
--disable-cache docker://nvcr.io/nvidia/tritonserver:20.06-py3-clientsdk
python /inputs/client.py -m gat_test -u localhost:8001

[error]
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
I0909 19:10:50.794980 21 metrics.cc:164] found 6 GPUs supporting NVML metrics
I0909 19:10:50.801559 21 metrics.cc:173] GPU 0: GeForce GTX 1080 Ti
I0909 19:10:50.808530 21 metrics.cc:173] GPU 1: GeForce GTX 1080 Ti
I0909 19:10:50.815280 21 metrics.cc:173] GPU 2: GeForce GTX 1080 Ti
I0909 19:10:50.822078 21 metrics.cc:173] GPU 3: GeForce GTX 1080 Ti
I0909 19:10:50.829149 21 metrics.cc:173] GPU 4: GeForce GTX 1080 Ti
I0909 19:10:50.836247 21 metrics.cc:173] GPU 5: GeForce GTX 1080 Ti
I0909 19:10:50.836589 21 server.cc:127] Initializing Triton Inference Server
E0909 19:10:52.002003 21 server.cc:168] failed to enable peer access for some device pairs
I0909 19:10:52.018406 21 server_status.cc:55] New status tracking for model 'gat_test'
I0909 19:10:52.018501 21 model_repository_manager.cc:723] loading: gat_test:1
I0909 19:10:52.026666 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu0 on GPU 0 (6.1) using model.pt
I0909 19:11:23.125336 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu1 on GPU 1 (6.1) using model.pt
I0909 19:11:53.259708 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu2 on GPU 2 (6.1) using model.pt
I0909 19:12:21.654204 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu3 on GPU 3 (6.1) using model.pt
I0909 19:12:50.234227 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu4 on GPU 4 (6.1) using model.pt
I0909 19:13:18.267221 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu5 on GPU 5 (6.1) using model.pt
I0909 19:13:46.825164 21 model_repository_manager.cc:888] successfully loaded 'gat_test' version 1
Starting endpoints, 'inference:0' listening on
I0909 19:13:46.828337 21 grpc_server.cc:1942] Started GRPCService at 0.0.0.0:8001
I0909 19:13:46.828415 21 http_server.cc:1428] Starting HTTPService at 0.0.0.0:8000
I0909 19:13:46.870846 21 http_server.cc:1443] Starting Metrics Service at 0.0.0.0:8002

And the server isn't running properly when I check with curl.

@kpedro88
Copy link
Collaborator Author

@mialiu149 sleep 2 might not be long enough? Otherwise, can you clarify the specific error you observe? Most of this just looks like the standard log messages printed by the server.

@lgray
Copy link

lgray commented Sep 15, 2020

@mialiu149 are you testing it from a remote machine or on that same machine?

It looks like it's bound to 0.0.0.0 rather than an external-facing ip.

@mialiu149
Copy link

so the log says that server is running, and also checked with singularity.
singularity instance list
INSTANCE NAME PID IP IMAGE
gat_test_server 23824 /scratch/data/mliu/triton-torchgeo-gat-example_singularity/tritonserver-20.06-v1-py3-geometric

if I check with curl:
curl -v localhost:8000/v2/health/ready

  • About to connect() to localhost port 8000 (#0)
  • Trying ::1...
  • Connection refused
  • Trying 127.0.0.1...
  • Connected to localhost (127.0.0.1) port 8000 (#0)

GET /v2/health/ready HTTP/1.1
User-Agent: curl/7.29.0
Host: localhost:8000
Accept: /

< HTTP/1.1 400 Bad Request
< Content-Length: 0
< Content-Type: text/plain
<

  • Connection #0 to host localhost left intact

running a local client now also throws errors:
INFO: Creating SIF file...
Traceback (most recent call last):
File "/inputs/client.py", line 65, in
mconf = triton_client.get_model_config(model_name, as_json=True)
File "/usr/local/lib/python3.6/dist-packages/tritongrpcclient/init.py", line 391, in get_model_config
raise_error_grpc(rpc_error)
File "/usr/local/lib/python3.6/dist-packages/tritongrpcclient/init.py", line 49, in raise_error_grpc
raise get_error_grpc(rpc_error) from None
tritonclientutils.InferenceServerException: [StatusCode.UNIMPLEMENTED]

@lgray
Copy link

lgray commented Sep 17, 2020

Ah!

tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API.
This is the tritonserver-geometric image you want to use for interacting with CMSSW.

For testing with the python scripts you want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.

@mialiu149
Copy link

mialiu149 commented Sep 17, 2020 via email

@mialiu149
Copy link

mialiu149 commented Sep 17, 2020 via email

@mialiu149
Copy link

mialiu149 commented Sep 17, 2020 via email

@lgray
Copy link

lgray commented Sep 17, 2020

The first try it's attempting to establish a connection with ipv6, that fails since it's not bound, and then it tried ipv4 and succeeds.

@lgray
Copy link

lgray commented Sep 17, 2020

And yeah, do python tests with py3-geometric and cmssw tests with v1-py3-geometric the torch related libraries are the same between the two so the models will work the same with either one. It's only tritonserver api that's different.

@mialiu149
Copy link

worked with CMSSW from a remote client. not sure why does the server check was unhealthy with curl.

@lgray
Copy link

lgray commented Sep 17, 2020

@mialiu149 it's succeeding but on the second try if you look at the logs you posted, for some reason it's trying an ipv6 address first (which isn't bound) which fails and then it tries 127.0.0.1:8000 and succeeds.

@lgray
Copy link

lgray commented Sep 17, 2020

Try it again with curl -4 -v localhost:8000/v2/health/ready

@kpedro88
Copy link
Collaborator Author

Progress on this issue:

  1. Followed https://github.com/lgray/triton-torchgeo-gat-example to build Docker container w/ PyTorch libraries
  2. Container now hosted on a FastML DockerHub account: https://hub.docker.com/repository/docker/fastml/triton-torchgeo
  3. Submitted PR to have Docker containers from that repo automatically converted to Singularity and hosted on cvmfs: https://gitlab.cern.ch/unpacked/sync/-/merge_requests/58 (the automatic singularity build command is very similar to @lgray's repo: https://github.com/cvmfs/cvmfs/blob/ff7728530936f3ef93bd5578cd9933bdc480be81/ducc/lib/image.go#L358)
  4. Combined all commands and options into a single script: cms-sw/cmssw@master...kpedro88:SonicTriton4

@lgray @mialiu149 let me know if you have any feedback before I submit the PR. The cmslpcgpu nodes are a good place to test both CPU and GPU modes (CPU mode can't be tested on normal cmslpc, because the AMD Opterons don't support AVX).

@kpedro88
Copy link
Collaborator Author

See: cms-sw/cmssw#31616

@kpedro88
Copy link
Collaborator Author

kpedro88 commented Oct 7, 2020

Now merged.

@kpedro88 kpedro88 closed this as completed Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants