Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem following Huggingface example #714

Closed
saeid93 opened this issue Sep 4, 2022 · 7 comments
Closed

problem following Huggingface example #714

saeid93 opened this issue Sep 4, 2022 · 7 comments

Comments

@saeid93
Copy link
Contributor

saeid93 commented Sep 4, 2022

Using the following json:

{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parallel_workers": 0,
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2",
            "optimum_model": true
        }
    }
}

from huggingface example resulted in the following error:

2022-09-04 16:07:24,722 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2022-09-04 16:07:25.588056: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-04 16:07:25.588085: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "/home/cc/miniconda3/envs/central/bin/mlserver", line 8, in <module>
    sys.exit(main())
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/cli/main.py", line 79, in main
    root()
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/cli/main.py", line 20, in wrapper
    return asyncio.run(f(*args, **kwargs))
  File "/home/cc/miniconda3/envs/central/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/cli/main.py", line 41, in start
    settings, models_settings = await load_settings(folder)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/cli/serve.py", line 36, in load_settings
    models_settings = await repository.list()
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/repository.py", line 32, in list
    model_settings = self._load_model_settings(model_settings_path)
  File "/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/mlserver/repository.py", line 45, in _load_model_settings
    model_settings = ModelSettings.parse_file(model_settings_path)
  File "pydantic/main.py", line 564, in pydantic.main.BaseModel.parse_file
  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
  File "pydantic/env_settings.py", line 38, in pydantic.env_settings.BaseSettings.__init__
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for ModelSettings
implementation
  ensure this value contains valid import path or valid callable: cannot import name 'deepspeed_reinit' from 'transformers.deepspeed' (/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/transformers/deepspeed.py) (type=type_error.pyobject; error_message=cannot import name 'deepspeed_reinit' from 'transformers.deepspeed' (/home/cc/miniconda3/envs/central/lib/python3.8/site-packages/transformers/deepspeed.py))

I couldn't follow what is wrong with the pydantic validation. I copy pasted from the readme notebook.
Installed packages:

pip freeze | grep mlser
mlserver==1.1.0
mlserver-huggingface==1.1.0
@adriangonz
Copy link
Contributor

Hey @saeid93 ,

This has been fixed in master already (see #668 ), but not released yet.

You can use the latest dev release instead though, 1.2.0.dev6.

@yaliqin
Copy link

yaliqin commented Sep 21, 2022

@adriangonz I met the same problem and change the image to 1.2.0.dev6, then I met liveness probe issue as below

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  3m9s                default-scheduler  Successfully assigned seldon/gds-default-0-transformer-868bc94ff5-4t7bx to seldon-aitest-node4-t4
  Normal   Pulled     3m7s                kubelet            Successfully pulled image "dockerhub.paypalcorp.com/yalqin/init-container:0.3" in 127.477806ms
  Normal   Pulling    3m7s                kubelet            Pulling image "dockerhub.paypalcorp.com/yalqin/init-container:0.3"
  Normal   Created    3m6s                kubelet            Created container model-initializer
  Normal   Started    3m6s                kubelet            Started container model-initializer
  Normal   Created    2m21s               kubelet            Created container seldon-container-engine
  Normal   Pulled     2m21s               kubelet            Container image "seldon.io/mlserver:1.2.0.dev6-huggingface" already present on machine
  Normal   Created    2m21s               kubelet            Created container transformer
  Normal   Pulled     2m21s               kubelet            Container image "seldon.io/seldon-core-executor:1.14.0" already present on machine
  Normal   Started    2m21s               kubelet            Started container transformer
  Normal   Started    2m20s               kubelet            Started container seldon-container-engine
  Warning  Unhealthy  84s (x4 over 2m)    kubelet            Readiness probe failed: HTTP probe failed with statuscode: 400
  Warning  Unhealthy  69s (x7 over 104s)  kubelet            Readiness probe failed: Get "http://10.36.0.62:9000/v2/health/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  69s (x3 over 79s)   kubelet            Liveness probe failed: Get "http://10.36.0.62:9000/v2/health/live": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    69s                 kubelet            Container transformer failed liveness probe, will be restarted

I did tested the huggingface prepackage server one or two months ago with mlserver:1.1.0-huggingface image. At that time, everything went smoothly with my own transformer models.

@adriangonz
Copy link
Contributor

Hey @yaliqin,

This may be an unrelated issue to the depspeed one. Could you share the logs from your model container?

@yaliqin
Copy link

yaliqin commented Sep 26, 2022

Hi @adriangonz , the pod status is as
(base) yalqin@seldon-aitest-node1:~/seldon-models/test-huggingface$ kgp |grep gds gds-default-0-transformer-7cd4c6d59f-svmxl 1/2 Running 1654 4d18h

The logs of model container is
(base) yalqin@seldon-aitest-node1:~/seldon-models/test-huggingface$ kl gds-default-0-transformer-7cd4c6d59f-svmxl transformer Environment tarball not found at '/mnt/models/environment.tar.gz' Environment not found at './environment' Dotenv file not found at './.env' 2022-09-26 16:05:17,611 [mlserver] INFO - Using asyncio event-loop policy: uvloop INFO: Started server process [1] INFO: Waiting for application startup. 2022-09-26 16:05:19,286 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:6000 2022-09-26 16:05:19,286 [mlserver.metrics] INFO - Prometheus scraping endpoint can be accessed on http://0.0.0.0:6000/prometheus INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. 2022-09-26 16:05:19,290 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:9500 INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:9100 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:6000 (Press CTRL+C to quit) INFO: 10.36.0.0:33466 - "GET /v2/health/ready HTTP/1.1" 400 Bad Request INFO: 10.36.0.0:33468 - "GET /v2/health/ready HTTP/1.1" 400 Bad Request INFO: 10.36.0.0:33470 - "GET /v2/health/ready HTTP/1.1" 400 Bad Request

The log of seldon-container-engine container is
{"level":"error","ts":1664208504.7109072,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp 127.0.0.1:9100: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}

@adriangonz
Copy link
Contributor

adriangonz commented Sep 26, 2022

Hey @yaliqin ,

Thanks for sharing those.

Unfortunately, I can't see any reference there to the depspeed error. I can see that the health endpoints are failing, but this is to be expected while the model is getting loaded.

You could try increasing the liveness probes, however it does seem strange that the model is "just taking too long". Would you be able to open up a separate issue so that we can dive further into this one?

@yaliqin
Copy link

yaliqin commented Sep 26, 2022

Hi @adriangonz ,
Thanks for the response. I changed the setting of the liveness probes and it worked! Thank you very much.

@adriangonz
Copy link
Contributor

That's great @yaliqin! 🚀

Thanks a lot for the update! I'm glad you managed to sort it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants