Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tolerations on gateway dask worker pods #567

Merged
merged 3 commits into from Mar 23, 2020

Conversation

scottyhq
Copy link
Member

fixes dask/dask-gateway#220

gateway-dask-worker pods get same tolerations as dask-kubernetes defaults https://github.com/dask/dask-kubernetes/blob/b88ebb1f596ffd7b91299191e51fcd7b1df98a29/dask_kubernetes/objects.py#L215

put scheduler pods on user-notebook nodes (although we might want to add a new nodegroup?)

@jhamman @TomAugspurger @tjcrone

@scottyhq
Copy link
Member Author

Sorry I overlooked #536, but the suggestions here are slightly different. For what it's worth I think we should put schedulers in their own nodegroup separate from users (just notebooks) or core (just things jupyterhub and dask pieces that always are running)

@scottyhq
Copy link
Member Author

... and might want to add some commits before merging to wrap up #496 (comment)

@scottyhq
Copy link
Member Author

noting that dask pods currently will still happily jump onto core nodes if room is available - this has come up before with the suggestion of also adding taints to core nodes (currently they don't have any) pangeo-data/pangeo-stacks#59

@TomAugspurger
Copy link
Member

Where are you hoping that the dask scheduler pods end up? In #536 we're ensuring they end up in the core pool.

So then the workers are on spot / preemptible nodes and the schedulers are on regular nodes.

@scottyhq
Copy link
Member Author

scottyhq commented Mar 19, 2020

@TomAugspurger AWS Spot versus GCE pre-emptible are a bit different (no 24 hour limit as far as I understand). we've actually been running all nodes on Spot for a number of weeks now (even the core nodes). Typically these run for days and every now and again get rebooted. I guess I'm not too worried about the occasional couple minute interruption. We're not really running any mission critical workflows...

Just to clarify we're also installing https://github.com/aws/aws-node-termination-handler so that if the core node is interrupted we have two minutes to automatically launch a new node and move pods to it. Haven't been operating this way for very long, but so far so good!

Where are you hoping that the dask scheduler pods end up?

User nodes seem better than core. Or a separate nodegroup.

@TomAugspurger
Copy link
Member

Good to know.

I don't have a strong preference about what node pool schedulers end up on. My slight preference is keeping them on regular (non-spot) nodes, since other groups are likely to copy our configuration and I wouldn't call running the scheduler on a spot instance a best practice (at least for mission critical things. The cost-benefit analysis will differ from group to group).

@TomAugspurger
Copy link
Member

The worker changes here should be non-controversial though.

I'll defer to others (cc @jhamman) on where best to put schedulers.

@scottyhq
Copy link
Member Author

@TomAugspurger and @jhamman - My arguments for user nodepool for now are

  1. If I'm not mistaken in our current dask-gateway setup the dask components are still very linked to the user-notebook session (If I shut down my notebook server the dask-scheduler also disappears).

  2. I like putting anything that should scale to 0 out of the core pool. This is a bit pre-emptive, but I'm afraid of situations where jupyterhub pods and scheduler pods get spread across many more nodes than are necessary (see Support for pod schedulers other than schedulerName: default-scheduler dask/dask-kubernetes#233)

Ultimately I think we want to decouple the gateway from jupyterhub altogether, correct? This would allow connecting to dask clusters in multiple regions, etc, in which case we eventually want distinct nodegroups for schedulers and workers.

Seems this is the current scheduler pod config / resource requests:

    Args:
      dask-gateway-scheduler
      --adaptive-period
      3.0
      --idle-timeout
      0.0
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2147483648
    Requests:
      cpu:     1
      memory:  2147483648

@jhamman
Copy link
Member

jhamman commented Mar 20, 2020

My 2 cents...

  • The resources the scheduler pods need are distinct from the jupyter pods (we wont ever need a gpu for the scheduler pod) so we shouldn't put them together.
  • The potential for poor scheduling on the core pool is a real concern and we don't want to run into a situation where core pods are overly spread out
  • So, we should probably create a separate node pool for the schedulers. This should be tuned to support the type of resource requests our scheduler pods will make and have a similar spot/preemptible profile as the notebook pods (i.e. if your cluster puts notebooks on spot, its probably okay to do the same for schedulers)

Ultimately I think we want to decouple the gateway from jupyterhub altogether, correct?

This is possible now but we will still have one gateway per hub. The nice thing about this architecture is that we can connect to gateways outside the k8s cluster that the jhub is in.

@TomAugspurger
Copy link
Member

TomAugspurger commented Mar 20, 2020 via email

@scottyhq
Copy link
Member Author

@scottyhq do you have thoughts on a dedicated node pool for schedulers?

Seems like a good approach to me. I suppose we need to change

to

        scheduler:
           extraPodConfig:
             tolerations:
               - key: "k8s.dask.org/dedicated"
                 operator: "Equal"
                 value: "scheduler"
                 effect: "NoSchedule"

Then we leave it up to each cluster to create a new nodegroup with this taint:
Taints: k8s.dask.org/dedicated=scheduler:NoSchedule

If the nodegroup doesn't exist the scheduler pods will go onto still untainted core nodes

@TomAugspurger
Copy link
Member

Yeah, that sounds about right to me. I should have time to add that scheduler pool for GCP deployments today.

@scottyhq
Copy link
Member Author

@TomAugspurger and @tjcrone I'm ready to merge this if that's okay. I think scheduler pods will still end up on core nodes, but we can fix that once dask/dask-kubernetes#164 is implemented. Sound good?

@TomAugspurger
Copy link
Member

Yep looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default tolerations on dask kubernetes workers
3 participants