-
Notifications
You must be signed in to change notification settings - Fork 17
Support the Dask operator KubeCluster #41
Support the Dask operator KubeCluster #41
Conversation
# Depending on the cluster type (Cluster or SpecCluster), | ||
# adapt should or shouldn't be awaited | ||
adapt_response = self._cluster.adapt(**self.adapt_kwargs) | ||
if iscoroutine(adapt_response): | ||
await adapt_response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will block the event loop if it's not a coroutine. You'll need to use inspect.iscoroutinefn
then run it in a worker thread if it is not (or check if its a Cluster
/ SpecCluster
). This seems like an implementation issue in Dask though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to use inspect.iscoroutinefunction
on the adapt
method itself and asyncio.isfuture
on the return value but this method is not declared async and instead returns a future of the private _adapt
handled with the method sync. I was only able to make it work with inspect.iscoroutine
.
Also, I am not sure to understand why we should run it in a worker thread if it's not async. I was trying to keep the same call as the original code here when used with the classic.KubeCluster
or other cloudprovider.*Instance
.
Sorry for my lack of experience with asyncio loops 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're running on an event loop and have some code that performs IO it must either be async and awaited or sync and run in a worker thread. The event loop is designed to switch quickly between a bunch of different tasks. When you run sync code in the event loop, it cannot switch to other tasks. This "blocks" the loop and causes performance problems.
It looks like the sync
utility checks if asynchronous
is set and returns a coroutine (they call it a "future", but it should be a coroutine from looking at the code).
# If used with the operator implementation of KubeCluster, | ||
# the cluster is not automatically started | ||
if self._cluster.status.value != "running": | ||
await self._cluster._start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not excited about reaching into a private method here. Is this require to get to an async method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that does not seem like the intended way of starting a cluster but I didn't find something relevant in their documentation when used with the asynchronous
arg. If we create the instance without it, this _start
method is called. I asked in their forum here to better understand how to use the new implementation asynchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some more context: in their __init__
:
if not self.asynchronous:
self._loop_runner.start()
self.sync(self._start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but prefect forces the asynchronous arg to True
. Should we allow to pass False
when used with operator.KubeCluster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that they don't call _start()
in the __aenter__
method when we enter the context above. This seems like an oversight. It feels like we should report this upstream and we should get some clarity on their intent here instead of introducing workarounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Let's wait to see if they answer to the thread I created in their discourse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't hear anything back, it might be faster to submit a GitHub issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll try to figure out how they should handle the asynchronous start.
@madkinsz @ahuang11 Hello 👋 |
Yes that's fine with me! |
This PR adds the support for the new Dask's operator.KubeCluster class. According to their documentation, this is the new preferred way to handle ephemeral clusters since the
classic.KubeCluster
won't be supported in next releases.The new
operator.KubeCluster
class does not inherit fromdistributed.deploy.SpecCluster
anymore so thestart
method is not called directly during instantiation.Closes PrefectHQ/prefect#12982
Example
With those changes, the following example works:
Checklist