Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment issue on AWS #3077

Closed
dimasheva1 opened this issue Mar 19, 2021 · 4 comments
Closed

Deployment issue on AWS #3077

dimasheva1 opened this issue Mar 19, 2021 · 4 comments

Comments

@dimasheva1
Copy link

Describe the bug

Whitelabel Error Page when I want to open the ambassador endpoint(istio endpoint doesn`t work because seldon core cannot be deployed) of my model on AWS. Endpoint http://localhost:9000/api/v1.0/predictions works fine for prediction locally via POST request.

To reproduce

  1. Create Python Wrapper
  2. build Docker image via Docker (From seldonio/seldon-core-s2i-python3:1.7.0-dev, python:3.8-slim same behaviour)
  3. deploy on AWS with ambassador ingress:
  • kubectl create namespace seldon-system
  • helm install seldon-core seldon-core-aws --repo https://storage.googleapis.com/seldon-aws-charts --version 0.5.0 --set usageMetrics.enabled=true --namespace seldon-system --set ambassador.enabled=true
  • kubectl create namespace ambassador || echo "namespace ambassador exists"
  • helm install ambassador datawire/ambassador
    --set image.repository=quay.io/datawire/ambassador
    --set enableAES=false
    --set crds.keep=false
    --namespace ambassador
  • Create secret docker-registry regcred
  • kubectl create -f seldon-deploy.yaml
  1. Open the ambassador endpoint: http://ae3f5ef3d71fc416485f0e9e83076db0-1449493934.us-west-2.elb.amazonaws.com/seldon/default/seldon-model/api/v1.0/doc

Expected behaviour

View Standardized User Interface

Environment

  • Cloud Provider: AWS
  • Kubernetes Cluster Version: 1.18
  • Deployed Seldon System Images:
    value: 403495124976.dkr.ecr.us-east-1.amazonaws.com/cc92bd08-3aee-4006-983a-b74fbf1cbfa8/cg-2585947346/seldonio/engine:0.5.0-latest
    image: 403495124976.dkr.ecr.us-east-1.amazonaws.com/cc92bd08-3aee-4006-983a-b74fbf1cbfa8/cg-2585947346/seldonio/seldon-core-operator:0.5.0-latest`

Model Details

  • Images of your model:
  • Dockerfile:
    FROM seldonio/seldon-core-s2i-python3:1.7.0-dev
    COPY . /app
    WORKDIR /app
    RUN pip install -r requirements.txt
    EXPOSE 5000
    EXPOSE 6000
    EXPOSE 9000
    ENV MODEL_NAME Titanic
    ENV SERVICE_TYPE MODEL
    ENV PERSISTENCE 0
    CMD exec seldon-core-microservice $MODEL_NAME --service-type $SERVICE_TYPE --persistence $PERSISTENCE
  • Seldon deployment yaml:
    apiVersion: machinelearning.seldon.io/v1alpha2
    kind: SeldonDeployment
    metadata:
    name: seldon-model
    spec:
    name: test-deployment
    predictors:
    • componentSpecs:
      • spec:
        containers:
        • name: classifier
          image: dimasheva1/seldon:latest
          imagePullSecrets:
        • name: regcred
          graph:
          children: []
          endpoint:
          type: GRPC
          name: classifier
          type: MODEL
          name: example
          replicas: 1
  • Logs of your model:
  • classifier:
    2021-03-19 09:34:56.922094: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
    2021-03-19 09:34:56.922143: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    2021-03-19 09:34:59,344 - seldon_core.microservice:main:203 - INFO: Starting microservice.py:main
    2021-03-19 09:34:59,344 - seldon_core.microservice:main:204 - INFO: Seldon Core version: 1.7.0-dev
    2021-03-19 09:34:59,347 - seldon_core.microservice:main:332 - INFO: Parse JAEGER_EXTRA_TAGS []
    2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation kubernetes.io/config.seen:2021-03-19T09:34:08.084412441Z
    2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation kubernetes.io/config.source:api
    2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation kubernetes.io/psp:eks.privileged
    2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation prometheus.io/path:prometheus
    2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation prometheus.io/port:8000
    2021-03-19 09:34:59,347 - seldon_core.microservice:load_annotations:155 - INFO: Found annotation prometheus.io/scrape:true
    2021-03-19 09:34:59,347 - seldon_core.microservice:main:335 - INFO: Annotations: {'kubernetes.io/config.seen': '2021-03-19T09:34:08.084412441Z', 'kubernetes.io/config.source': 'api', 'kubernetes.io/psp': 'eks.privileged', 'prometheus.io/path': 'prometheus', 'prometheus.io/port': '8000', 'prometheus.io/scrape': 'true'}
    2021-03-19 09:34:59,347 - seldon_core.microservice:main:339 - INFO: Importing Titanic
    2021-03-19 09:34:59,374 - seldon_core.microservice:main:422 - INFO: REST gunicorn microservice running on port 9000
    2021-03-19 09:34:59,376 - seldon_core.microservice:main:476 - INFO: REST metrics microservice running on port 6000
    2021-03-19 09:34:59,376 - seldon_core.microservice:main:486 - INFO: Starting servers
    2021-03-19 09:34:59,392 - seldon_core.wrapper:_set_flask_app_configs:213 - INFO: App Config: <Config {'ENV': 'production', 'DEBUG': False, 'TESTING': False, 'PROPAGATE_EXCEPTIONS': None, 'PRESERVE_CONTEXT_ON_EXCEPTION': None, 'SECRET_KEY': None, 'PERMANENT_SESSION_LIFETIME': datetime.timedelta(days=31), 'USE_X_SENDFILE': False, 'SERVER_NAME': None, 'APPLICATION_ROOT': '/', 'SESSION_COOKIE_NAME': 'session', 'SESSION_COOKIE_DOMAIN': None, 'SESSION_COOKIE_PATH': None, 'SESSION_COOKIE_HTTPONLY': True, 'SESSION_COOKIE_SECURE': False, 'SESSION_COOKIE_SAMESITE': None, 'SESSION_REFRESH_EACH_REQUEST': True, 'MAX_CONTENT_LENGTH': None, 'SEND_FILE_MAX_AGE_DEFAULT': datetime.timedelta(seconds=43200), 'TRAP_BAD_REQUEST_ERRORS': None, 'TRAP_HTTP_EXCEPTIONS': False, 'EXPLAIN_TEMPLATE_LOADING': False, 'PREFERRED_URL_SCHEME': 'http', 'JSON_AS_ASCII': True, 'JSON_SORT_KEYS': True, 'JSONIFY_PRETTYPRINT_REGULAR': False, 'JSONIFY_MIMETYPE': 'application/json', 'TEMPLATES_AUTO_RELOAD': None, 'MAX_COOKIE_SIZE': 4093}>
    2021-03-19 09:34:59,411 - seldon_core.wrapper:_set_flask_app_configs:213 - INFO: App Config: <Config {'ENV': 'production', 'DEBUG': False, 'TESTING': False, 'PROPAGATE_EXCEPTIONS': None, 'PRESERVE_CONTEXT_ON_EXCEPTION': None, 'SECRET_KEY': None, 'PERMANENT_SESSION_LIFETIME': datetime.timedelta(days=31), 'USE_X_SENDFILE': False, 'SERVER_NAME': None, 'APPLICATION_ROOT': '/', 'SESSION_COOKIE_NAME': 'session', 'SESSION_COOKIE_DOMAIN': None, 'SESSION_COOKIE_PATH': None, 'SESSION_COOKIE_HTTPONLY': True, 'SESSION_COOKIE_SECURE': False, 'SESSION_COOKIE_SAMESITE': None, 'SESSION_REFRESH_EACH_REQUEST': True, 'MAX_CONTENT_LENGTH': None, 'SEND_FILE_MAX_AGE_DEFAULT': datetime.timedelta(seconds=43200), 'TRAP_BAD_REQUEST_ERRORS': None, 'TRAP_HTTP_EXCEPTIONS': False, 'EXPLAIN_TEMPLATE_LOADING': False, 'PREFERRED_URL_SCHEME': 'http', 'JSON_AS_ASCII': True, 'JSON_SORT_KEYS': True, 'JSONIFY_PRETTYPRINT_REGULAR': False, 'JSONIFY_MIMETYPE': 'application/json', 'TEMPLATES_AUTO_RELOAD': None, 'MAX_COOKIE_SIZE': 4093}>
    [2021-03-19 09:34:59 +0000] [1] [INFO] Starting gunicorn 20.0.4
    [2021-03-19 09:34:59 +0000] [1] [INFO] Listening at: http://0.0.0.0:9000 (1)
    [2021-03-19 09:34:59 +0000] [1] [INFO] Using worker: threads
    [2021-03-19 09:34:59 +0000] [28] [INFO] Booting worker with pid: 28
    2021-03-19 09:34:59,451 - seldon_core.gunicorn_utils:load:88 - INFO: Tracing branch is active
    [2021-03-19 09:34:59 +0000] [22] [INFO] Starting gunicorn 20.0.4
    [2021-03-19 09:34:59 +0000] [22] [INFO] Listening at: http://0.0.0.0:6000 (22)
    [2021-03-19 09:34:59 +0000] [22] [INFO] Using worker: sync
    2021-03-19 09:34:59,458 - seldon_core.utils:setup_tracing:724 - INFO: Initializing tracing
    [2021-03-19 09:34:59 +0000] [30] [INFO] Booting worker with pid: 30
    2021-03-19 09:34:59.505769: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
    2021-03-19 09:34:59.505886: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
    2021-03-19 09:34:59.505961: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (test-deployment-example-50a2328-db496c484-fw2mz): /proc/driver/nvidia/version does not exist
    2021-03-19 09:34:59.509202: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2021-03-19 09:34:59.522083: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499995000 Hz
    2021-03-19 09:34:59.522499: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557139c42470 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2021-03-19 09:34:59.522608: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
    2021-03-19 09:34:59,542 - seldon_core.utils:setup_tracing:731 - INFO: Using default tracing config
    2021-03-19 09:34:59,542 - jaeger_tracing:_create_local_agent_channel:446 - INFO: Initializing Jaeger Tracer with UDP reporter
    2021-03-19 09:34:59,545 - jaeger_tracing:new_tracer:384 - INFO: Using sampler ConstSampler(True)
    2021-03-19 09:34:59,550 - jaeger_tracing:_initialize_global_tracer:436 - INFO: opentracing.tracer initialized to <jaeger_client.tracer.Tracer object at 0x7fa808cc2a10>[app_name=Titanic]
    2021-03-19 09:34:59,550 - seldon_core.gunicorn_utils:load:93 - INFO: Set JAEGER_EXTRA_TAGS []
    2021-03-19 09:34:59.624040: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
    2021-03-19 09:34:59.624092: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
    2021-03-19 09:34:59.624132: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (test-deployment-example-50a2328-db496c484-fw2mz): /proc/driver/nvidia/version does not exist
    2021-03-19 09:34:59.624483: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2021-03-19 09:34:59.651544: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499995000 Hz
    2021-03-19 09:34:59.652034: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557139c42470 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2021-03-19 09:34:59.652060: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
    2021-03-19 09:34:59,852 - seldon_core.microservice:grpc_prediction_server:452 - INFO: GRPC microservice Running on port 5000
  • seldon-container-engine
    2021-03-19 09:35:04.387 INFO 6 --- [ main] i.s.e.App : Starting App v0.5.0 on test-deployment-example-50a2328-db496c484-fw2mz with PID 6 (/app.jar started by ? in /)
    2021-03-19 09:35:04.400 INFO 6 --- [ main] i.s.e.App : No active profile set, falling back to default profiles: default
    2021-03-19 09:35:06.396 INFO 6 --- [ main] i.s.e.c.CustomizationBean : Customizing EmbeddedServlet
    2021-03-19 09:35:06.398 INFO 6 --- [ main] i.s.e.c.CustomizationBean : FOUND env var [ENGINE_SERVER_PORT], will use for engine server port
    2021-03-19 09:35:06.398 INFO 6 --- [ main] i.s.e.c.CustomizationBean : setting serverPort[8000]
    2021-03-19 09:35:07.333 WARN 6 --- [ main] o.s.h.c.j.Jackson2ObjectMapperBuilder : For Jackson Kotlin classes support please add "com.fasterxml.jackson.module:jackson-module-kotlin" to the classpath
    2021-03-19 09:35:07.513 INFO 6 --- [ main] i.s.e.p.EnginePredictor : init
    2021-03-19 09:35:07.514 INFO 6 --- [ main] i.s.e.p.EnginePredictor : FOUND env var [ENGINE_PREDICTOR], will use for engine predictor
    2021-03-19 09:35:08.040 INFO 6 --- [ main] i.s.e.p.EnginePredictor : Setting deployment name to test-deployment
    2021-03-19 09:35:08.066 INFO 6 --- [ main] i.s.e.p.EnginePredictor : Installed engine predictor: {"name":"example","graph":{"name":"classifier","children":[],"type":"MODEL","implementation":"UNKNOWN_IMPLEMENTATION","methods":[],"endpoint":{"service_host":"localhost","service_port":9000,"type":"GRPC"},"parameters":[],"modelUri":"","serviceAccountName":"","envSecretRefName":""},"componentSpecs":[{"metadata":{"name":"","generateName":"","namespace":"","selfLink":"","uid":"","resourceVersion":"","generation":0,"deletionGracePeriodSeconds":0,"labels":{},"annotations":{},"ownerReferences":[],"finalizers":[],"clusterName":""},"spec":{"volumes":[],"containers":[{"name":"classifier","image":"dimasheva1/seldon:latest","command":[],"args":[],"workingDir":"","ports":[{"name":"grpc","hostPort":0,"containerPort":9000,"protocol":"TCP","hostIP":""}],"env":[{"name":"PREDICTIVE_UNIT_SERVICE_PORT","value":"9000"},{"name":"PREDICTIVE_UNIT_ID","value":"classifier"},{"name":"PREDICTOR_ID","value":"example"},{"name":"SELDON_DEPLOYMENT_ID","value":"seldon-model"}],"resources":{"limits":{},"requests":{}},"volumeMounts":[{"name":"podinfo","readOnly":false,"mountPath":"/etc/podinfo","subPath":"","mountPropagation":""}],"livenessProbe":{"initialDelaySeconds":60,"timeoutSeconds":1,"periodSeconds":5,"successThreshold":1,"failureThreshold":3},"readinessProbe":{"initialDelaySeconds":20,"timeoutSeconds":1,"periodSeconds":5,"successThreshold":1,"failureThreshold":3},"lifecycle":{"preStop":{"exec":{"command":["/bin/sh","-c","/bin/sleep 10"]}}},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"IfNotPresent","stdin":false,"stdinOnce":false,"tty":false,"envFrom":[],"terminationMessagePolicy":"File","volumeDevices":[]}],"restartPolicy":"","terminationGracePeriodSeconds":0,"activeDeadlineSeconds":0,"dnsPolicy":"","nodeSelector":{},"serviceAccountName":"","serviceAccount":"","nodeName":"","hostNetwork":false,"hostPID":false,"hostIPC":false,"imagePullSecrets":[{"name":"regcred"}],"hostname":"","subdomain":"","schedulerName":"","initContainers":[],"automountServiceAccountToken":false,"tolerations":[],"hostAliases":[],"priorityClassName":"","priority":0,"shareProcessNamespace":false,"readinessGates":[],"runtimeClassName":"","enableServiceLinks":false}}],"replicas":1,"annotations":{},"engineResources":{"limits":{},"requests":{}},"labels":{"version":"example"},"svcOrchSpec":{"env":[]},"traffic":0,"explainer":{"type":"","modelUri":"","serviceAccountName":"","envSecretRefName":"","containerSpec":{"name":"","image":"","command":[],"args":[],"workingDir":"","ports":[],"env":[],"resources":{"limits":{},"requests":{}},"volumeMounts":[],"terminationMessagePath":"","imagePullPolicy":"","stdin":false,"stdinOnce":false,"tty":false,"envFrom":[],"terminationMessagePolicy":"","volumeDevices":[]}}}
    2021-03-19 09:35:08.075 INFO 6 --- [ main] i.s.e.c.AnnotationsConfig : Annotations {kubernetes.io/config.source=api, kubernetes.io/psp=eks.privileged, kubernetes.io/config.seen=2021-03-19T09:34:08.084412441Z, prometheus.io/path=prometheus, prometheus.io/port=8000, prometheus.io/scrape=true}
    2021-03-19 09:35:08.079 INFO 6 --- [ main] i.s.e.t.TracingProvider : Not activating tracing
    2021-03-19 09:35:08.080 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : REST Connection timeout set to 200
    2021-03-19 09:35:08.081 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : REST read timeout set to 5000
    2021-03-19 09:35:08.390 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : gRPC max message size set to 4194304
    2021-03-19 09:35:08.391 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : gRPC read timeout set to 5000
    2021-03-19 09:35:08.391 INFO 6 --- [ main] i.s.e.s.InternalPredictionService : REST retries set to 3
    2021-03-19 09:35:08.524 INFO 6 --- [ main] i.s.e.g.SeldonGrpcServer : FOUND env var [ENGINE_SERVER_GRPC_PORT], will use engine server port 5001
    2021-03-19 09:35:08.843 INFO 6 --- [ task-1] i.s.e.g.SeldonGrpcServer : Starting grpc server
    2021-03-19 09:35:09.111 INFO 6 --- [ task-1] i.s.e.g.SeldonGrpcServer : Server started, listening on 5001
    2021-03-19 09:35:09.780 INFO 6 --- [ main] i.s.e.App : Started App in 6.037 seconds (JVM running for 7.275)
@dimasheva1 dimasheva1 added bug triage Needs to be triaged and prioritised accordingly labels Mar 19, 2021
@ukclivecox
Copy link
Contributor

Can you clarify the yaml and actual issue you are seeing in more detail. Are you sure this is not an Ambassador setup issue on AWS?

@ukclivecox ukclivecox removed the triage Needs to be triaged and prioritised accordingly label Mar 25, 2021
@ukclivecox ukclivecox added this to Triage in Backlog via automation Mar 25, 2021
@ukclivecox ukclivecox moved this from Triage to Q&A in Backlog Mar 25, 2021
@dimasheva1
Copy link
Author

dimasheva1 commented Mar 26, 2021

The Yaml file is a very simple Seldon deployment which I copy from Seldon docs. I just add the secret, my Docker image and change the type to GRPC.
Actual issue: my endpoints don't work. Swagger page doesn't work.
I don`t know about the Ambassador setup. I read Seldon docs and ran commands in turn. When I tried to replace the Ambassador to istio, pods did not start.

@axsaucedo
Copy link
Contributor

@dimasheva1 are you sending the requests through the load balancer? We test ambassador on integration tests and have production enviornments on EKS and there are no issues there, so seems like a cluster configuration issue or confusion

@axsaucedo
Copy link
Contributor

May be easier if you can ask these questions on the slack as you'll be able to get quicker answers to user questions - will close and can reopen if we confirm issue on slack

@ukclivecox ukclivecox removed this from Q&A in Backlog Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants