Deployment issue on AWS #3077

dimasheva1 · 2021-03-19T10:47:11Z

Describe the bug

Whitelabel Error Page when I want to open the ambassador endpoint(istio endpoint doesn`t work because seldon core cannot be deployed) of my model on AWS. Endpoint http://localhost:9000/api/v1.0/predictions works fine for prediction locally via POST request.

To reproduce

Create Python Wrapper
build Docker image via Docker (From seldonio/seldon-core-s2i-python3:1.7.0-dev, python:3.8-slim same behaviour)
deploy on AWS with ambassador ingress:

kubectl create namespace seldon-system
helm install seldon-core seldon-core-aws --repo https://storage.googleapis.com/seldon-aws-charts --version 0.5.0 --set usageMetrics.enabled=true --namespace seldon-system --set ambassador.enabled=true
kubectl create namespace ambassador || echo "namespace ambassador exists"
helm install ambassador datawire/ambassador
--set image.repository=quay.io/datawire/ambassador
--set enableAES=false
--set crds.keep=false
--namespace ambassador
Create secret docker-registry regcred
kubectl create -f seldon-deploy.yaml

Open the ambassador endpoint: http://ae3f5ef3d71fc416485f0e9e83076db0-1449493934.us-west-2.elb.amazonaws.com/seldon/default/seldon-model/api/v1.0/doc

Expected behaviour

View Standardized User Interface

Environment

Cloud Provider: AWS
Kubernetes Cluster Version: 1.18
Deployed Seldon System Images:
value: 403495124976.dkr.ecr.us-east-1.amazonaws.com/cc92bd08-3aee-4006-983a-b74fbf1cbfa8/cg-2585947346/seldonio/engine:0.5.0-latest
image: 403495124976.dkr.ecr.us-east-1.amazonaws.com/cc92bd08-3aee-4006-983a-b74fbf1cbfa8/cg-2585947346/seldonio/seldon-core-operator:0.5.0-latest`

Model Details

Images of your model:

Dockerfile:
FROM seldonio/seldon-core-s2i-python3:1.7.0-dev
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
EXPOSE 5000
EXPOSE 6000
EXPOSE 9000
ENV MODEL_NAME Titanic
ENV SERVICE_TYPE MODEL
ENV PERSISTENCE 0
CMD exec seldon-core-microservice $MODEL_NAME --service-type $SERVICE_TYPE --persistence $PERSISTENCE
Seldon deployment yaml:
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: seldon-model
spec:
name: test-deployment
predictors:
- componentSpecs:
  - spec:
    containers:
    - name: classifier
      image: dimasheva1/seldon:latest
      imagePullSecrets:
    - name: regcred
      graph:
      children: []
      endpoint:
      type: GRPC
      name: classifier
      type: MODEL
      name: example
      replicas: 1

Logs of your model:

classifier:
2021-03-19 09:34:56.922094: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-03-19 09:34:56.922143: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-03-19 09:34:59,344 - seldon_core.microservice:main:203 - INFO: Starting microservice.py:main
2021-03-19 09:34:59,344 - seldon_core.microservice:main:204 - INFO: Seldon Core version: 1.7.0-dev
2021-03-19 09:34:59,347 - seldon_core.microservice:main:332 - INFO: Parse JAEGER_EXTRA_TAGS []
2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation kubernetes.io/config.seen:2021-03-19T09:34:08.084412441Z
2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation kubernetes.io/config.source:api
2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation kubernetes.io/psp:eks.privileged
2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation prometheus.io/path:prometheus
2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation prometheus.io/port:8000
2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation prometheus.io/scrape:true
2021-03-19 09:34:59,347 - seldon_core.microservice:main:335 - INFO: Annotations: {'kubernetes.io/config.seen': '2021-03-19T09:34:08.084412441Z', 'kubernetes.io/config.source': 'api', 'kubernetes.io/psp': 'eks.privileged', 'prometheus.io/path': 'prometheus', 'prometheus.io/port': '8000', 'prometheus.io/scrape': 'true'}
2021-03-19 09:34:59,347 - seldon_core.microservice:main:339 - INFO: Importing Titanic
2021-03-19 09:34:59,374 - seldon_core.microservice:main:422 - INFO: REST gunicorn microservice running on port 9000
2021-03-19 09:34:59,376 - seldon_core.microservice:main:476 - INFO: REST metrics microservice running on port 6000
2021-03-19 09:34:59,376 - seldon_core.microservice:main:486 - INFO: Starting servers
2021-03-19 09:34:59,392 - seldon_core.wrapper:_set_flask_app_configs:213 - INFO: App Config: <Config {'ENV': 'production', 'DEBUG': False, 'TESTING': False, 'PROPAGATE_EXCEPTIONS': None, 'PRESERVE_CONTEXT_ON_EXCEPTION': None, 'SECRET_KEY': None, 'PERMANENT_SESSION_LIFETIME': datetime.timedelta(days=31), 'USE_X_SENDFILE': False, 'SERVER_NAME': None, 'APPLICATION_ROOT': '/', 'SESSION_COOKIE_NAME': 'session', 'SESSION_COOKIE_DOMAIN': None, 'SESSION_COOKIE_PATH': None, 'SESSION_COOKIE_HTTPONLY': True, 'SESSION_COOKIE_SECURE': False, 'SESSION_COOKIE_SAMESITE': None, 'SESSION_REFRESH_EACH_REQUEST': True, 'MAX_CONTENT_LENGTH': None, 'SEND_FILE_MAX_AGE_DEFAULT': datetime.timedelta(seconds=43200), 'TRAP_BAD_REQUEST_ERRORS': None, 'TRAP_HTTP_EXCEPTIONS': False, 'EXPLAIN_TEMPLATE_LOADING': False, 'PREFERRED_URL_SCHEME': 'http', 'JSON_AS_ASCII': True, 'JSON_SORT_KEYS': True, 'JSONIFY_PRETTYPRINT_REGULAR': False, 'JSONIFY_MIMETYPE': 'application/json', 'TEMPLATES_AUTO_RELOAD': None, 'MAX_COOKIE_SIZE': 4093}>
2021-03-19 09:34:59,411 - seldon_core.wrapper:_set_flask_app_configs:213 - INFO: App Config: <Config {'ENV': 'production', 'DEBUG': False, 'TESTING': False, 'PROPAGATE_EXCEPTIONS': None, 'PRESERVE_CONTEXT_ON_EXCEPTION': None, 'SECRET_KEY': None, 'PERMANENT_SESSION_LIFETIME': datetime.timedelta(days=31), 'USE_X_SENDFILE': False, 'SERVER_NAME': None, 'APPLICATION_ROOT': '/', 'SESSION_COOKIE_NAME': 'session', 'SESSION_COOKIE_DOMAIN': None, 'SESSION_COOKIE_PATH': None, 'SESSION_COOKIE_HTTPONLY': True, 'SESSION_COOKIE_SECURE': False, 'SESSION_COOKIE_SAMESITE': None, 'SESSION_REFRESH_EACH_REQUEST': True, 'MAX_CONTENT_LENGTH': None, 'SEND_FILE_MAX_AGE_DEFAULT': datetime.timedelta(seconds=43200), 'TRAP_BAD_REQUEST_ERRORS': None, 'TRAP_HTTP_EXCEPTIONS': False, 'EXPLAIN_TEMPLATE_LOADING': False, 'PREFERRED_URL_SCHEME': 'http', 'JSON_AS_ASCII': True, 'JSON_SORT_KEYS': True, 'JSONIFY_PRETTYPRINT_REGULAR': False, 'JSONIFY_MIMETYPE': 'application/json', 'TEMPLATES_AUTO_RELOAD': None, 'MAX_COOKIE_SIZE': 4093}>
[2021-03-19 09:34:59 +0000] [1] [INFO] Starting gunicorn 20.0.4
[2021-03-19 09:34:59 +0000] [1] [INFO] Listening at: http://0.0.0.0:9000 (1)
[2021-03-19 09:34:59 +0000] [1] [INFO] Using worker: threads
[2021-03-19 09:34:59 +0000] [28] [INFO] Booting worker with pid: 28
2021-03-19 09:34:59,451 - seldon_core.gunicorn_utils:load:88 - INFO: Tracing branch is active
[2021-03-19 09:34:59 +0000] [22] [INFO] Starting gunicorn 20.0.4
[2021-03-19 09:34:59 +0000] [22] [INFO] Listening at: http://0.0.0.0:6000 (22)
[2021-03-19 09:34:59 +0000] [22] [INFO] Using worker: sync
2021-03-19 09:34:59,458 - seldon_core.utils:setup_tracing:724 - INFO: Initializing tracing
[2021-03-19 09:34:59 +0000] [30] [INFO] Booting worker with pid: 30
2021-03-19 09:34:59.505769: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-19 09:34:59.505886: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-19 09:34:59.505961: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (test-deployment-example-50a2328-db496c484-fw2mz): /proc/driver/nvidia/version does not exist
2021-03-19 09:34:59.509202: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-19 09:34:59.522083: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499995000 Hz
2021-03-19 09:34:59.522499: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557139c42470 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-19 09:34:59.522608: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-19 09:34:59,542 - seldon_core.utils:setup_tracing:731 - INFO: Using default tracing config
2021-03-19 09:34:59,542 - jaeger_tracing:_create_local_agent_channel:446 - INFO: Initializing Jaeger Tracer with UDP reporter
2021-03-19 09:34:59,545 - jaeger_tracing:new_tracer:384 - INFO: Using sampler ConstSampler(True)
2021-03-19 09:34:59,550 - jaeger_tracing:_initialize_global_tracer:436 - INFO: opentracing.tracer initialized to <jaeger_client.tracer.Tracer object at 0x7fa808cc2a10>[app_name=Titanic]
2021-03-19 09:34:59,550 - seldon_core.gunicorn_utils:load:93 - INFO: Set JAEGER_EXTRA_TAGS []
2021-03-19 09:34:59.624040: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-19 09:34:59.624092: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-19 09:34:59.624132: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (test-deployment-example-50a2328-db496c484-fw2mz): /proc/driver/nvidia/version does not exist
2021-03-19 09:34:59.624483: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-19 09:34:59.651544: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499995000 Hz
2021-03-19 09:34:59.652034: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557139c42470 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-19 09:34:59.652060: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-19 09:34:59,852 - seldon_core.microservice:grpc_prediction_server:452 - INFO: GRPC microservice Running on port 5000
seldon-container-engine
2021-03-19 09:35:04.387 INFO 6 --- [ main] i.s.e.App : Starting App v0.5.0 on test-deployment-example-50a2328-db496c484-fw2mz with PID 6 (/app.jar started by ? in /)
2021-03-19 09:35:04.400 INFO 6 --- [ main] i.s.e.App : No active profile set, falling back to default profiles: default
2021-03-19 09:35:06.396 INFO 6 --- [ main] i.s.e.c.CustomizationBean : Customizing EmbeddedServlet
2021-03-19 09:35:06.398 INFO 6 --- [ main] i.s.e.c.CustomizationBean : FOUND env var [ENGINE_SERVER_PORT], will use for engine server port
2021-03-19 09:35:06.398 INFO 6 --- [ main] i.s.e.c.CustomizationBean : setting serverPort[8000]
2021-03-19 09:35:07.333 WARN 6 --- [ main] o.s.h.c.j.Jackson2ObjectMapperBuilder : For Jackson Kotlin classes support please add "com.fasterxml.jackson.module:jackson-module-kotlin" to the classpath
2021-03-19 09:35:07.513 INFO 6 --- [ main] i.s.e.p.EnginePredictor : init
2021-03-19 09:35:07.514 INFO 6 --- [ main] i.s.e.p.EnginePredictor : FOUND env var [ENGINE_PREDICTOR], will use for engine predictor
2021-03-19 09:35:08.040 INFO 6 --- [ main] i.s.e.p.EnginePredictor : Setting deployment name to test-deployment
2021-03-19 09:35:08.066 INFO 6 --- [ main] i.s.e.p.EnginePredictor : Installed engine predictor: {"name":"example","graph":{"name":"classifier","children":[],"type":"MODEL","implementation":"UNKNOWN_IMPLEMENTATION","methods":[],"endpoint":{"service_host":"localhost","service_port":9000,"type":"GRPC"},"parameters":[],"modelUri":"","serviceAccountName":"","envSecretRefName":""},"componentSpecs":[{"metadata":{"name":"","generateName":"","namespace":"","selfLink":"","uid":"","resourceVersion":"","generation":0,"deletionGracePeriodSeconds":0,"labels":{},"annotations":{},"ownerReferences":[],"finalizers":[],"clusterName":""},"spec":{"volumes":[],"containers":[{"name":"classifier","image":"dimasheva1/seldon:latest","command":[],"args":[],"workingDir":"","ports":[{"name":"grpc","hostPort":0,"containerPort":9000,"protocol":"TCP","hostIP":""}],"env":[{"name":"PREDICTIVE_UNIT_SERVICE_PORT","value":"9000"},{"name":"PREDICTIVE_UNIT_ID","value":"classifier"},{"name":"PREDICTOR_ID","value":"example"},{"name":"SELDON_DEPLOYMENT_ID","value":"seldon-model"}],"resources":{"limits":{},"requests":{}},"volumeMounts":[{"name":"podinfo","readOnly":false,"mountPath":"/etc/podinfo","subPath":"","mountPropagation":""}],"livenessProbe":{"initialDelaySeconds":60,"timeoutSeconds":1,"periodSeconds":5,"successThreshold":1,"failureThreshold":3},"readinessProbe":{"initialDelaySeconds":20,"timeoutSeconds":1,"periodSeconds":5,"successThreshold":1,"failureThreshold":3},"lifecycle":{"preStop":{"exec":{"command":["/bin/sh","-c","/bin/sleep 10"]}}},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"IfNotPresent","stdin":false,"stdinOnce":false,"tty":false,"envFrom":[],"terminationMessagePolicy":"File","volumeDevices":[]}],"restartPolicy":"","terminationGracePeriodSeconds":0,"activeDeadlineSeconds":0,"dnsPolicy":"","nodeSelector":{},"serviceAccountName":"","serviceAccount":"","nodeName":"","hostNetwork":false,"hostPID":false,"hostIPC":false,"imagePullSecrets":[{"name":"regcred"}],"hostname":"","subdomain":"","schedulerName":"","initContainers":[],"automountServiceAccountToken":false,"tolerations":[],"hostAliases":[],"priorityClassName":"","priority":0,"shareProcessNamespace":false,"readinessGates":[],"runtimeClassName":"","enableServiceLinks":false}}],"replicas":1,"annotations":{},"engineResources":{"limits":{},"requests":{}},"labels":{"version":"example"},"svcOrchSpec":{"env":[]},"traffic":0,"explainer":{"type":"","modelUri":"","serviceAccountName":"","envSecretRefName":"","containerSpec":{"name":"","image":"","command":[],"args":[],"workingDir":"","ports":[],"env":[],"resources":{"limits":{},"requests":{}},"volumeMounts":[],"terminationMessagePath":"","imagePullPolicy":"","stdin":false,"stdinOnce":false,"tty":false,"envFrom":[],"terminationMessagePolicy":"","volumeDevices":[]}}}
2021-03-19 09:35:08.075 INFO 6 --- [ main] i.s.e.c.AnnotationsConfig : Annotations {kubernetes.io/config.source=api, kubernetes.io/psp=eks.privileged, kubernetes.io/config.seen=2021-03-19T09:34:08.084412441Z, prometheus.io/path=prometheus, prometheus.io/port=8000, prometheus.io/scrape=true}
2021-03-19 09:35:08.079 INFO 6 --- [ main] i.s.e.t.TracingProvider : Not activating tracing
2021-03-19 09:35:08.080 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : REST Connection timeout set to 200
2021-03-19 09:35:08.081 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : REST read timeout set to 5000
2021-03-19 09:35:08.390 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : gRPC max message size set to 4194304
2021-03-19 09:35:08.391 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : gRPC read timeout set to 5000
2021-03-19 09:35:08.391 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : REST retries set to 3
2021-03-19 09:35:08.524 INFO 6 --- [ main] i.s.e.g.SeldonGrpcServer : FOUND env var [ENGINE_SERVER_GRPC_PORT], will use engine server port 5001
2021-03-19 09:35:08.843 INFO 6 --- [ task-1] i.s.e.g.SeldonGrpcServer : Starting grpc server
2021-03-19 09:35:09.111 INFO 6 --- [ task-1] i.s.e.g.SeldonGrpcServer : Server started, listening on 5001
2021-03-19 09:35:09.780 INFO 6 --- [ main] i.s.e.App : Started App in 6.037 seconds (JVM running for 7.275)

ukclivecox · 2021-03-25T07:27:49Z

Can you clarify the yaml and actual issue you are seeing in more detail. Are you sure this is not an Ambassador setup issue on AWS?

dimasheva1 · 2021-03-26T11:47:11Z

The Yaml file is a very simple Seldon deployment which I copy from Seldon docs. I just add the secret, my Docker image and change the type to GRPC.
Actual issue: my endpoints don't work. Swagger page doesn't work.
I don`t know about the Ambassador setup. I read Seldon docs and ran commands in turn. When I tried to replace the Ambassador to istio, pods did not start.

axsaucedo · 2021-04-01T07:12:44Z

@dimasheva1 are you sending the requests through the load balancer? We test ambassador on integration tests and have production enviornments on EKS and there are no issues there, so seems like a cluster configuration issue or confusion

axsaucedo · 2021-04-01T07:13:17Z

May be easier if you can ask these questions on the slack as you'll be able to get quicker answers to user questions - will close and can reopen if we confirm issue on slack

dimasheva1 added bug triage Needs to be triaged and prioritised accordingly labels Mar 19, 2021

ukclivecox removed the triage Needs to be triaged and prioritised accordingly label Mar 25, 2021

ukclivecox added this to Triage in Backlog via automation Mar 25, 2021

ukclivecox moved this from Triage to Q&A in Backlog Mar 25, 2021

ukclivecox added the awaiting-feedback label Mar 25, 2021

axsaucedo closed this as completed Apr 1, 2021

ukclivecox removed this from Q&A in Backlog Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment issue on AWS #3077

Deployment issue on AWS #3077

dimasheva1 commented Mar 19, 2021

ukclivecox commented Mar 25, 2021

dimasheva1 commented Mar 26, 2021 •

edited

Loading

axsaucedo commented Apr 1, 2021

axsaucedo commented Apr 1, 2021

Deployment issue on AWS #3077

Deployment issue on AWS #3077

Comments

dimasheva1 commented Mar 19, 2021

Describe the bug

To reproduce

Expected behaviour

Environment

Model Details

ukclivecox commented Mar 25, 2021

dimasheva1 commented Mar 26, 2021 • edited Loading

axsaucedo commented Apr 1, 2021

axsaucedo commented Apr 1, 2021

dimasheva1 commented Mar 26, 2021 •

edited

Loading