Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added example of Versioned PiWind deployments #22

Merged
merged 5 commits into from
Jan 10, 2024

Conversation

sambles
Copy link
Contributor

@sambles sambles commented Dec 13, 2023

Updated deployment for PR OasisLMF/OasisPlatform#931

This adds two deployments piwind, one for each API version v1 and v2. There is currently an issue with auto-scaling between both these deployments. For testing disable the worker-controller and manually scale the piwind deployments.

$ kubectl scale --replicas=0 deployment/oasis-worker-controller
deployment.apps/oasis-worker-controller scaled
$ kubectl scale --replicas=1 deployment/worker-oasislmf-piwind-1-v1
deployment.apps/worker-oasislmf-piwind-1-v1 scaled
$ kubectl scale --replicas=1 deployment/worker-oasislmf-piwind-1-v2
deployment.apps/worker-oasislmf-piwind-1-v2 scaled

Screenshot from 2023-12-14 11-57-18

To edit the worker v1 image switch the tag in these lines, for example dev -> 1.28.5 and rerun ./deploy.sh models

oasislmf_piwind_1: # A name that is unique among all workers
supplierId: OasisLMF # Must be identical to supplier in the model data file share
modelId: PiWind # Must be identical to name in the model data file share
modelVersionId: "1" # Must be identical to version in the model data file share
apiVersion: "v1" # Single Server execution
image: ${ACR}/coreoasis/model_worker # The path to your image, ${ACR} will automatically be replaced with your environments URL
version: dev # Version tag of your image

Troubleshooting - Celery DB connection error

If v1 workers are crashing with the following error

kubectl logs worker-oasislmf-piwind-1-v1-56977d8c49-q8t6m
Defaulted container "worker" out of: worker, init-tcp-wait-by-secret (init)
wait-for-it.sh: waiting 60 seconds for broker:5672
wait-for-it.sh: broker:5672 is available after 0 seconds
wait-for-it.sh: waiting 60 seconds for oasis-j25vz4tnmuh5m.privatelink.postgres.database.azure.com:5432
wait-for-it.sh: oasis-j25vz4tnmuh5m.privatelink.postgres.database.azure.com:5432 is available after 0 seconds
[2023-12-14 13:35:46,296: CRITICAL/MainProcess] Unrecoverable error: ValueError("Port could not be cast to integer value as '******' ")

The error is caused by special characters in celery password, tripping up the celery DB connection. Note that v2 don't have this issue because they are escaped within the celery app.

The easiest workaround is updating the celery password by deleting the celery-db-password entry from the Key Vault.
Screenshot from 2024-01-04 10-49-22

Then rerunning the deployment ./deploy.sh base using this branch, this generates a new password without the problematic characters.

@sambles sambles merged commit ff64f52 into develop Jan 10, 2024
@sambles sambles deleted the update/piwind-example-versioned branch January 10, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant