Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initContainer tfserving-model-initializer is not able to pull model from s3 #3330

Closed
mukul-shaunik opened this issue Jun 25, 2021 · 8 comments
Labels

Comments

@mukul-shaunik
Copy link

Describe the bug

When SeldonDeployment is created with istio siddecar injection enabled with mtls strict mode
initContainer tfserving-model-initializer is not able to pull model from s3
There is a known issue in istio side that initContainer will not be able to do outside calls
istio/istio#11130

To reproduce

  • enabled the istio sidecar injection in the namespace in which SeldonDeployment is to be created
  • use minio client to copy the model to the required s3 bucket
  • and while creation of SeldonDeployment provide model uri for the s3 bucket which has the model.

Expected behaviour

tfserving-model-initializer initContainer should be able to fetch the model from s3 uri

Environment

  • Cloud Provider: BareMetal
  • Kubernetes Cluster Version
    Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.5", GitCommit:"e338cf2c6d297aa603b50ad3a301f761b4173aa6", GitTreeState:"clean", BuildDate:"2020-12-09T11:18:51Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.5", GitCommit:"e338cf2c6d297aa603b50ad3a301f761b4173aa6", GitTreeState:"clean", BuildDate:"2020-12-09T11:10:32Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
  • Deployed Seldon System Images: [Output of kubectl get --namespace seldon-system deploy seldon-controller-manager -o yaml | grep seldonio] docker.io/seldonio/seldon-core-operator:1.6.0

Model Details

  • Images of your model: [Output of: kubectl get seldondeployment -n <yourmodelnamespace> <seldondepname> -o yaml | grep image: where <yourmodelnamespace>]

seldonio/tfserving-proxy:1.6.0

tensorflow/serving:2.1.0

  • Logs of your model:
[W 210625 10:03:55 connectionpool:780] Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1ceb11c0a0>: Failed to establish a new connection: [Errno 111] Connection refused')': /tfserving?location=
[W 210625 10:03:55 connectionpool:780] Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1ceb11c250>: Failed to establish a new connection: [Errno 111] Connection refused')': /tfserving?location=
[W 210625 10:03:56 connectionpool:780] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1ceb11c400>: Failed to establish a new connection: [Errno 111] Connection refused')': /tfserving?location=
[W 210625 10:03:57 connectionpool:780] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1ceb11c5b0>: Failed to establish a new connection: [Errno 111] Connection refused')': /tfserving?location=
[W 210625 10:04:01 connectionpool:780] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1ceb11c760>: Failed to establish a new connection: [Errno 111] Connection refused')': /tfserving?location=
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1299, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1248, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1008, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 948, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 181, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f1ceb11c910>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//a.py", line 2, in <module>
    kfserving.Storage.download("s3://tfserving/verA", "/tmp/m1")
  File "/usr/local/lib/python3.9/site-packages/kfserving/storage.py", line 62, in download
    Storage._download_s3(uri, out_dir)
  File "/usr/local/lib/python3.9/site-packages/kfserving/storage.py", line 89, in _download_s3
    for obj in objects:
  File "/usr/local/lib/python3.9/site-packages/minio/api.py", line 2294, in _list_objects
    response = self._url_open(
  File "/usr/local/lib/python3.9/site-packages/minio/api.py", line 2189, in _url_open
    region = self._get_bucket_region(bucket_name)
  File "/usr/local/lib/python3.9/site-packages/minio/api.py", line 2067, in _get_bucket_region
    region = self._get_bucket_location(bucket_name)
  File "/usr/local/lib/python3.9/site-packages/minio/api.py", line 2100, in _get_bucket_location
    response = self._http.urlopen(method, url,
  File "/usr/local/lib/python3.9/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='x.x.x.x', port=9000): Max retries exceeded with url: /tfserving?location= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1ceb11c910>: Failed to establish a new connection: [Errno 111] Connection refused'))
@mukul-shaunik mukul-shaunik added bug triage Needs to be triaged and prioritised accordingly labels Jun 25, 2021
@mukul-shaunik
Copy link
Author

/cc @hemantha-kumara
/cc @narasago
/cc @nrchakradhar

@ukclivecox
Copy link
Contributor

Have you tried with the latest seldon core that uses rclone?

@ukclivecox ukclivecox removed the triage Needs to be triaged and prioritised accordingly label Jul 29, 2021
@nrchakradhar
Copy link

The root of the problem is that init-containers DO NOT work well withIstio sidecar proxy. Is there any possibility in Seldon to have the models fetched remotely outside of init-containers. If such an option is available it can be used.
If it's not available, can it be a request to add such a feature?

@axsaucedo
Copy link
Contributor

It seems there's a suggested fix that could be used through istio specific annotations @nrchakradhar have you tried this? https://stackoverflow.com/questions/64356701/allow-requests-to-kubernetes-api-from-an-init-container-with-istio-cni-plugin

@nrchakradhar
Copy link

@axsaucedo Thanks for the reference. The exclusion option has not been an encouraging solution. If model fetching can be made outside of the init-container, it will be useful. We are using without Istio in certain configurations.

@RafalSkolasinski
Copy link
Contributor

RafalSkolasinski commented Dec 21, 2021

We will need more information about the Istio configuration that is causing the issue, just presence of sidecars is not enough.

I have just deployed

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: iris
  namespace: sidecars
spec:
  predictors:
  - name: default
    replicas: 1
    graph:
      name: classifier
      implementation: SKLEARN_SERVER
      modelUri: gs://seldon-models/v1.11.2/sklearn/iris

in namespace with sidecars injections enabled:

kubectl label namespace sidecars istio-injection=enabled --overwrite

without any issues.

I observed that rclone init container executed before istio-init without any issues.

Seldon Core: 1.11.2
k8s: 1.20.11
Istio: 1.9.5

@ukclivecox
Copy link
Contributor

Closing. Please reopen if still an issue

@malikkirchner
Copy link

The workaround only works, if the model source is outside the service mesh, e.g. google storage. I am running a minio instance inside the service mesh and would like to pull models from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants