Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Jan 23, 2024

What do these changes do?

This PR basically changes the protocol used by the whole dask backend to enable TLS security (see https://distributed.dask.org/en/latest/tls.html) and also encrypts communication between all the dask parts (client, scheduler, workers). Notable exception to this rule is the dask-scheduler dashboard that is still in unprotected, but is also read-only so deemed as a lesser issue.

To this end the Makefile was adapted to ensure a self-signed certificate is created before the stack may be deployed locally (see make certificates in services/dask-sidecar). These certificates are git-ignored and are not deleted unless make clean is run. The certificates are passed as docker secrets inside the different dask-aware services (dask-sidecar, dask-scheduler, director-v2, clusters-keeper and computational-autoscaling).

Currently the certificates are the same for all the osparc-defined clusters (default cluster and external clusters that are created by clusters-keeper) for sanity.

details

  • improve pytest-simcore docker_stack error messages to become more dev-friendly
  • improved tests in clusters-keeper to find out when ENVs are missing to run the docker-compose file

Related issue/s

How to test

Dev Checklist

DevOps Checklist

osparc-ops-deployment-configuration MRs:

osparc-ops-environments PRs:

BREAKING CHANGES MRs:

details

  1. the dask-certificates are made available via docker secrets to the dask-aware services (i.e. dask-sidecar/scheduler, director-v2, autoscaling, clusters-keeper)
  2. general new ENVs:
DASK_TLS_CA_FILE=/home/scu/.dask/dask-crt.pem
DASK_TLS_KEY=/home/scu/.dask/dask-key.pem
DASK_TLS_CERT=/home/scu/.dask/dask-crt.pem
  1. director-v2 BREAKING CHANGES ENVs - ⚠️ devops
    Ensure this changed from tcp to tls protocol
COMPUTATIONAL_BACKEND_DEFAULT_CLUSTER_URL=tls://XXXXX_dask-scheduler:8786

This variable was added but contains a default because this is the only way to keep the .env simple on simcore side (it basically is made out of the dask-sidecar variables). Therefore, this would not need to be setup in ops-env

COMPUTATIONAL_BACKEND_DEFAULT_CLUSTER_AUTH='{"type":"tls","tls_ca_file":"${DASK_TLS_CERT}","tls_client_cert":"${DASK_TLS_CERT}","tls_client_key":"${DASK_TLS_KEY}"}'
  1. clusters-keeper ENVs changes - ⚠️ devops (only for AWS deployments)
    These are new ENVs, they allow connecting on-demand clusters with the certificates.
    In order for the deployed EC2s to use the certificates they need to download them out of AWS ParameterStore as the certificates exceed the 16KB limit when starting an instance.
PRIMARY_EC2_INSTANCES_ATTACHED_IAM_PROFILE=AWS_SSM_READONLY_PROFILE
PRIMARY_EC2_INSTANCES_SSM_TLS_DASK_CA=/AWS/SSM/TLS_DASK_CA
PRIMARY_EC2_INSTANCES_SSM_TLS_DASK_CERT=/AWS/SSM/TLS_DASK_CERT
PRIMARY_EC2_INSTANCES_SSM_TLS_DASK_KEY=/AWS/SSM/TLS_DASK_KEY

The following is also not necessary in ops-env for same reason as explained in 2.

CLUSTERS_KEEPER_COMPUTATIONAL_BACKEND_DEFAULT_CLUSTER_AUTH='{"type":"tls","tls_ca_file":"${DASK_TLS_CERT}","tls_client_cert":"${DASK_TLS_CERT}","tls_client_key":"${DASK_TLS_KEY}"}'
  1. the autoscaling now receives the following new ENVs (only for computational mode, i.e. not affecting ops env)
DASK_SCHEDULER_AUTH='{"type":"tls","tls_ca_file":"/home/scu/.dask/dask-crt.pem","tls_client_cert":"/home/scu/.dask/dask-crt.pem","tls_client_key":"/home/scu/.dask/dask-key.pem"}'

@sanderegg sanderegg added this to the This is Sparta! milestone Jan 23, 2024
@sanderegg sanderegg self-assigned this Jan 23, 2024
@codecov
Copy link

codecov bot commented Jan 23, 2024

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (1797692) 85.3% compared to head (c864a1f) 87.1%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #5265     +/-   ##
========================================
+ Coverage    85.3%   87.1%   +1.7%     
========================================
  Files        1316    1316             
  Lines       53753   53813     +60     
  Branches     1170    1170             
========================================
+ Hits        45904   46912   +1008     
+ Misses       7600    6652    -948     
  Partials      249     249             
Flag Coverage Δ
integrationtests 63.7% <88.8%> (-1.4%) ⬇️
unittests 85.1% <81.5%> (+2.0%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
packages/aws-library/src/aws_library/ec2/client.py 100.0% <ø> (ø)
packages/aws-library/src/aws_library/ec2/models.py 88.6% <100.0%> (+0.1%) ⬆️
...ages/models-library/src/models_library/clusters.py 100.0% <100.0%> (ø)
...g/src/simcore_service_autoscaling/core/settings.py 97.9% <100.0%> (+<0.1%) ⬆️
...e_service_autoscaling/modules/auto_scaling_core.py 94.2% <ø> (ø)
...scaling/modules/auto_scaling_mode_computational.py 91.0% <100.0%> (+0.4%) ⬆️
...c/simcore_service_clusters_keeper/core/settings.py 96.0% <100.0%> (+0.2%) ⬆️
...imcore_service_clusters_keeper/modules/clusters.py 100.0% <ø> (ø)
...lusters_keeper/modules/clusters_management_core.py 100.0% <100.0%> (ø)
.../src/simcore_service_clusters_keeper/utils/dask.py 100.0% <100.0%> (ø)
... and 7 more

... and 71 files with indirect coverage changes

@sanderegg sanderegg force-pushed the enable-tls-security branch 4 times, most recently from fc88c81 to f6d2ca2 Compare January 25, 2024 15:09
@sanderegg sanderegg changed the title Secure Dask backend using self-signed certificates Secure Dask backend using self-signed certificates (⚠️ devops) Jan 25, 2024
@sanderegg sanderegg added a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:director-v2 issue related with the director-v2 service a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:autoscaling autoscaling service in simcore's stack a:clusters-keeper labels Jan 26, 2024
@sanderegg sanderegg force-pushed the enable-tls-security branch 4 times, most recently from c7ad51f to 9c0bfce Compare January 29, 2024 11:22
@sanderegg sanderegg marked this pull request as ready for review January 29, 2024 14:52
@sanderegg sanderegg force-pushed the enable-tls-security branch from 32747dd to b20c15a Compare January 29, 2024 14:52
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!
As a general remark, I wonder whether all the certificates tooling (create, delete, list, ...) could be generalized for any service and not only the dask services.
If it is not such a hustle, it might become handy in the future for some other service.

@sanderegg sanderegg force-pushed the enable-tls-security branch from c458849 to b0798f5 Compare January 30, 2024 07:14
@sanderegg sanderegg force-pushed the enable-tls-security branch from b0798f5 to c864a1f Compare January 30, 2024 09:32
@sonarqubecloud
Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link
Member

@mrnicegyu11 mrnicegyu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack for the devops changes. Looks scarry but I saw no issues conceptually

@sanderegg sanderegg merged commit 94ab546 into ITISFoundation:master Jan 30, 2024
@sanderegg sanderegg deleted the enable-tls-security branch January 30, 2024 13:17
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Feb 14, 2024
39 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:autoscaling autoscaling service in simcore's stack a:clusters-keeper a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:director-v2 issue related with the director-v2 service a:infra+ops maintenance of infrastructure or operations (discussed in retro)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Secure connection to external computational clusters

5 participants