problem following Huggingface example #714

saeid93 · 2022-09-04T16:18:36Z

Using the following json:

{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parallel_workers": 0,
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2",
            "optimum_model": true
        }
    }
}

from huggingface example resulted in the following error:

2022-09-04 16:07:24,722 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2022-09-04 16:07:25.588056: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-04 16:07:25.588085: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "/home/cc/miniconda3/envs/central/bin/mlserver", line 8, in <module>
    sys.exit(main())
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/cli/main.py", line 79, in main
    root()
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/cli/main.py", line 20, in wrapper
    return asyncio.run(f(*args, **kwargs))
  File "/home/cc/miniconda3/envs/central/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/cli/main.py", line 41, in start
    settings, models_settings = await load_settings(folder)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/cli/serve.py", line 36, in load_settings
    models_settings = await repository.list()
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/repository.py", line 32, in list
    model_settings = self._load_model_settings(model_settings_path)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/repository.py", line 45, in _load_model_settings
    model_settings = ModelSettings.parse_file(model_settings_path)
  File "pydantic/main.py", line 564, in pydantic.main.BaseModel.parse_file
  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
  File "pydantic/env_settings.py", line 38, in pydantic.env_settings.BaseSettings.__init__
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for ModelSettings
implementation
  ensure this value contains valid import path or valid callable: cannot import name 'deepspeed_reinit' from 'transformers.deepspeed' (/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/transformers/deepspeed.py) (type=type_error.pyobject; error_message=cannot import name 'deepspeed_reinit' from 'transformers.deepspeed' (/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/transformers/deepspeed.py))

I couldn't follow what is wrong with the pydantic validation. I copy pasted from the readme notebook.
Installed packages:

pip freeze | grep mlser
mlserver==1.1.0
mlserver-huggingface==1.1.0

The text was updated successfully, but these errors were encountered:

adriangonz · 2022-09-05T13:50:45Z

Hey @saeid93 ,

This has been fixed in master already (see #668 ), but not released yet.

You can use the latest dev release instead though, 1.2.0.dev6.

yaliqin · 2022-09-21T16:53:08Z

@adriangonz I met the same problem and change the image to 1.2.0.dev6, then I met liveness probe issue as below

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  3m9s                default-scheduler  Successfully assigned seldon/gds-default-0-transformer-868bc94ff5-4t7bx to seldon-aitest-node4-t4
  Normal   Pulled     3m7s                kubelet            Successfully pulled image "dockerhub.paypalcorp.com/yalqin/init-container:0.3" in 127.477806ms
  Normal   Pulling    3m7s                kubelet            Pulling image "dockerhub.paypalcorp.com/yalqin/init-container:0.3"
  Normal   Created    3m6s                kubelet            Created container model-initializer
  Normal   Started    3m6s                kubelet            Started container model-initializer
  Normal   Created    2m21s               kubelet            Created container seldon-container-engine
  Normal   Pulled     2m21s               kubelet            Container image "seldon.io/mlserver:1.2.0.dev6-huggingface" already present on machine
  Normal   Created    2m21s               kubelet            Created container transformer
  Normal   Pulled     2m21s               kubelet            Container image "seldon.io/seldon-core-executor:1.14.0" already present on machine
  Normal   Started    2m21s               kubelet            Started container transformer
  Normal   Started    2m20s               kubelet            Started container seldon-container-engine
  Warning  Unhealthy  84s (x4 over 2m)    kubelet            Readiness probe failed: HTTP probe failed with statuscode: 400
  Warning  Unhealthy  69s (x7 over 104s)  kubelet            Readiness probe failed: Get "http://10.36.0.62:9000/v2/health/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  69s (x3 over 79s)   kubelet            Liveness probe failed: Get "http://10.36.0.62:9000/v2/health/live": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    69s                 kubelet            Container transformer failed liveness probe, will be restarted

I did tested the huggingface prepackage server one or two months ago with mlserver:1.1.0-huggingface image. At that time, everything went smoothly with my own transformer models.

adriangonz · 2022-09-22T08:12:55Z

Hey @yaliqin,

This may be an unrelated issue to the depspeed one. Could you share the logs from your model container?

yaliqin · 2022-09-26T16:10:06Z

Hi @adriangonz , the pod status is as
(base) yalqin@seldon-aitest-node1:~/seldon-models/test-huggingface$ kgp |grep gds gds-default-0-transformer-7cd4c6d59f-svmxl 1/2 Running 1654 4d18h

The logs of model container is
(base) yalqin@seldon-aitest-node1:~/seldon-models/test-huggingface$ kl gds-default-0-transformer-7cd4c6d59f-svmxl transformer Environment tarball not found at '/mnt/models/environment.tar.gz' Environment not found at './environment' Dotenv file not found at './.env' 2022-09-26 16:05:17,611 [mlserver] INFO - Using asyncio event-loop policy: uvloop INFO: Started server process [1] INFO: Waiting for application startup. 2022-09-26 16:05:19,286 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:6000 2022-09-26 16:05:19,286 [mlserver.metrics] INFO - Prometheus scraping endpoint can be accessed on http://0.0.0.0:6000/prometheus INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. 2022-09-26 16:05:19,290 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:9500 INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:9100 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:6000 (Press CTRL+C to quit) INFO: 10.36.0.0:33466 - "GET /v2/health/ready HTTP/1.1" 400 Bad Request INFO: 10.36.0.0:33468 - "GET /v2/health/ready HTTP/1.1" 400 Bad Request INFO: 10.36.0.0:33470 - "GET /v2/health/ready HTTP/1.1" 400 Bad Request

The log of seldon-container-engine container is
{"level":"error","ts":1664208504.7109072,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp 127.0.0.1:9100: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}

adriangonz · 2022-09-26T17:00:34Z

Hey @yaliqin ,

Thanks for sharing those.

Unfortunately, I can't see any reference there to the depspeed error. I can see that the health endpoints are failing, but this is to be expected while the model is getting loaded.

You could try increasing the liveness probes, however it does seem strange that the model is "just taking too long". Would you be able to open up a separate issue so that we can dive further into this one?

yaliqin · 2022-09-26T18:18:24Z

Hi @adriangonz ,
Thanks for the response. I changed the setting of the liveness probes and it worked! Thank you very much.

adriangonz · 2022-09-27T08:03:12Z

That's great @yaliqin! 🚀

Thanks a lot for the update! I'm glad you managed to sort it out.

adriangonz closed this as completed Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem following Huggingface example #714

problem following Huggingface example #714

saeid93 commented Sep 4, 2022 •

edited

adriangonz commented Sep 5, 2022

yaliqin commented Sep 21, 2022

adriangonz commented Sep 22, 2022

yaliqin commented Sep 26, 2022

adriangonz commented Sep 26, 2022 •

edited

yaliqin commented Sep 26, 2022

adriangonz commented Sep 27, 2022

problem following Huggingface example #714

problem following Huggingface example #714

Comments

saeid93 commented Sep 4, 2022 • edited

adriangonz commented Sep 5, 2022

yaliqin commented Sep 21, 2022

adriangonz commented Sep 22, 2022

yaliqin commented Sep 26, 2022

adriangonz commented Sep 26, 2022 • edited

yaliqin commented Sep 26, 2022

adriangonz commented Sep 27, 2022

saeid93 commented Sep 4, 2022 •

edited

adriangonz commented Sep 26, 2022 •

edited