Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask-gateway checklist #496

Closed
7 of 8 tasks
jhamman opened this issue Jan 9, 2020 · 39 comments
Closed
7 of 8 tasks

Dask-gateway checklist #496

jhamman opened this issue Jan 9, 2020 · 39 comments

Comments

@jhamman
Copy link
Member

jhamman commented Jan 9, 2020

We're in the process of testing and customizing dask-gateway. Here is a checklist (from #481) for todo items that we can work on. We can use this issue to track progress on these plus any additional configurations we see as necessary.

@TomAugspurger
Copy link
Member

make sure dask clusters are putting pods on correct nodes (check taints/affinities/etc)

@jhamman which node pool do you expect the dask scheduler to land on? IMO, it makes sense for it to be in the jupyter pool (not preemptible), and the Dask workers to be in the dask-pool (preemptible).

I think right now we're adding the scheduler to the dask-pool, but it doesn't have the right toleration to run on preemptible nodes.

@TomAugspurger
Copy link
Member

but it doesn't have the right toleration to run on preemptible nodes.

Nevermind, it does have the toleration. We're getting failures to start clusters because a different issue (looking into it now).

May still be worth discussing where the scheduler is run though.

@jhamman
Copy link
Member Author

jhamman commented Jan 23, 2020

I agree, the scheduler should be in the same pool as the jupyter notebook. We may need to just use the jupyter toleration here.

@TomAugspurger
Copy link
Member

@jcrist I'm looking into securing the dask gateways with TLS. I think lack of HTTPs is preventing the clusters working with the juptyerlab plugin (loading http content on an https webpage). I had a high-level question before I began digging into it too far.

Our jupyterhubs are using zero-to-jupyterhub's auto-https. This makes an autohttps pod that, IIUC, handles all the lets encrypt stuff automatically.

Is it sensible / possible for the dask-gateway pods to reuse that setup to get the TLS certificates in place?

@TomAugspurger
Copy link
Member

TLS is now working on staging.hub.pangeo.io, which fixed the dask-labextension.

Screen Shot 2020-01-30 at 10 43 30 AM

Thanks Jim!

@jhamman
Copy link
Member Author

jhamman commented Feb 18, 2020

@TomAugspurger - FYI, I've marked all the public gateway IP addresses as static.

@scottyhq
Copy link
Member

scottyhq commented Mar 17, 2020

@jhamman and @TomAugspurger - an initial test of this is not working on the AWS hub, probably related to getting https set up on the various SVCs created by dask-gateway. Might also be due to relatively new autohttps setup on jupyterhub #563.

Following this configuration
https://github.com/pangeo-data/pangeo-cloud-federation/pull/520/files

I see

NAME                                         TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)                      AGE
gateway-api-icesat2-prod-dask-gateway        ClusterIP      10.100.228.113   <none>                                                                    8001/TCP                     31d
hub                                          ClusterIP      10.100.88.80     <none>                                                                    8081/TCP                     347d
proxy-api                                    ClusterIP      10.100.43.198    <none>                                                                    8001/TCP                     347d
proxy-http                                   ClusterIP      10.100.135.9     <none>                                                                    8000/TCP                     347d
proxy-public                                 LoadBalancer   10.100.166.102   XXXX(PROXY-PUBLIC).us-west-2.elb.amazonaws.com   443:32434/TCP,80:30204/TCP   347d
scheduler-api-icesat2-prod-dask-gateway      ClusterIP      10.100.133.197   <none>                                                                    8001/TCP                     31d
scheduler-public-icesat2-prod-dask-gateway   LoadBalancer   10.100.199.78    XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com    8786:31542/TCP               31d
web-api-icesat2-prod-dask-gateway            ClusterIP      10.100.42.60     <none>                                                                    8001/TCP                     31d
web-public-icesat2-prod-dask-gateway         LoadBalancer   10.100.192.199   XXXX(WEB-PUBLIC).us-west-2.elb.amazonaws.com   80:30816/TCP                 31d

And if I set:

gateway = Gateway(address='https://XXXX(WEB-PUBLIC)-west-2.elb.amazonaws.com/services/dask-gateway',
                  proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
                  auth='jupyterhub')

I get the following traceback running cluster = gateway.new_cluster()

---------------------------------------------------------------------------
HTTPTimeoutError                          Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _fetch(self, req)
    345             self._cookie_jar.pre_request(req)
--> 346             resp = await client.fetch(req, raise_error=False)
    347             if resp.code == 401:
HTTPTimeoutError: Timeout while connecting
During handling of the above exception, another exception occurred:
TimeoutError                              Traceback (most recent call last)
<ipython-input-7-bf15802dc141> in <module>
----> 1 cluster = gateway.new_cluster()
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
    581         cluster : GatewayCluster
    582         """
--> 583         return GatewayCluster(
    584             address=self.address,
    585             proxy_address=self.proxy_address,
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
    751         **kwargs,
    752     )--> 753         self._init_internal(
    754             address=address,
    755             proxy_address=proxy_address,
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
    851             self.status = "starting"
    852         if not self.asynchronous:
--> 853             self.gateway.sync(self._start_internal)
    854 
    855     @property
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    310             )
    311             try:
--> 312                 return future.result()
    313             except BaseException:
    314                 future.cancel()
/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    437                 raise CancelledError()
    438             elif self._state == FINISHED:
--> 439                 return self.__get_result()
    440             else:
    441                 raise TimeoutError()
/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _start_internal(self)
    865             self._start_task = asyncio.ensure_future(self._start_async())
    866         try:
--> 867             await self._start_task
    868         except BaseException:
    869             # On exception, cleanup
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _start_async(self)
    878         if self.status == "created":
    879             self.status = "starting"
--> 880             self.name = await self.gateway._submit(
    881                 cluster_options=self._cluster_options, **self._cluster_kwargs
    882             )
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
    475             headers=HTTPHeaders({"Content-type": "application/json"}),
    476         )
--> 477         resp = await self._fetch(req)
    478         data = json.loads(resp.body)
    479         return data["name"]
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _fetch(self, req)
    370             # Tornado 6 still raises these above with raise_error=False
    371             if exc.code == 599:
--> 372                 raise TimeoutError("Request timed out")
    373             # Should never get here!
    374             raise
TimeoutError: Request timed out

kubectl logs scheduler-proxy-icesat2-prod-dask-gateway-6c584cd5b7-2v9mq -n icesat2-prod is reporting [W 2020-03-17 23:20:35.656 SchedulerProxy] Extracting SNI: Error reading TLS record header: EOF every couple of seconds. I'm suspicious https isn't automatically being enabled for those external IPs because dropping the '(s)' from https works and I'm able to access the dashboard URL.

Seems related to dask/dask-gateway#191, so pinging @jcrist and @yuvipanda

@jcrist
Copy link
Member

jcrist commented Mar 17, 2020

I think you want:

# Use the jupyterhub proxy address above, not the dask-gateway proxy address
gateway = Gateway(address='XXXX(PROXY-PUBLIC).us-west-2.elb.amazonaws.com/services/dask-gateway',
                  proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
                  auth='jupyterhub')

gateway.list_clusters()

Assuming you're intending on routing through the JupyterHub proxy.

@scottyhq
Copy link
Member

scottyhq commented Mar 18, 2020

Thanks for the tip @jcrist - yes that is what we're going for. I was confused as to which IP to use. Unfortunately using that ELB or the JHub public address I see two different tracebacks:

gateway = Gateway(address='XXXX(PROXY-PUBLIC).us-west-2.elb.amazonaws.com/services/dask-gateway',
                  proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
                  auth='jupyterhub')
---------------------------------------------------------------------------
SSLError                                  Traceback (most recent call last)
<ipython-input-28-bf15802dc141> in <module>
----> 1 cluster = gateway.new_cluster()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
    589             cluster_options=cluster_options,
    590             shutdown_on_close=shutdown_on_close,
--> 591             **kwargs,
    592         )
    593 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
    759             shutdown_on_close=shutdown_on_close,
    760             asynchronous=asynchronous,
--> 761             loop=loop,
    762         )
    763 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
    851             self.status = "starting"
    852         if not self.asynchronous:
--> 853             self.gateway.sync(self._start_internal)
    854 
    855     @property

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    310             )
    311             try:
--> 312                 return future.result()
    313             except BaseException:
    314                 future.cancel()

/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
    865             self._start_task = asyncio.ensure_future(self._start_async())
    866         try:
--> 867             await self._start_task
    868         except BaseException:
    869             # On exception, cleanup

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
    879             self.status = "starting"
    880             self.name = await self.gateway._submit(
--> 881                 cluster_options=self._cluster_options, **self._cluster_kwargs
    882             )
    883         # Connect to cluster

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
    475             headers=HTTPHeaders({"Content-type": "application/json"}),
    476         )
--> 477         resp = await self._fetch(req)
    478         data = json.loads(resp.body)
    479         return data["name"]

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _fetch(self, req)
    344         try:
    345             self._cookie_jar.pre_request(req)
--> 346             resp = await client.fetch(req, raise_error=False)
    347             if resp.code == 401:
    348                 context = self.auth.pre_request(req, resp)

/srv/conda/envs/pangeo/lib/python3.7/site-packages/tornado/simple_httpclient.py in run(self)
    334                     ssl_options=ssl_options,
    335                     max_buffer_size=self.max_buffer_size,
--> 336                     source_ip=source_ip,
    337                 )
    338 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/tornado/tcpclient.py in connect(self, host, port, af, ssl_options, max_buffer_size, source_ip, source_port, timeout)
    292             else:
    293                 stream = await stream.start_tls(
--> 294                     False, ssl_options=ssl_options, server_hostname=host
    295                 )
    296         return stream

/srv/conda/envs/pangeo/lib/python3.7/site-packages/tornado/iostream.py in _do_ssl_handshake(self)
   1415             self._handshake_reading = False
   1416             self._handshake_writing = False
-> 1417             self.socket.do_handshake()
   1418         except ssl.SSLError as err:
   1419             if err.args[0] == ssl.SSL_ERROR_WANT_READ:

/srv/conda/envs/pangeo/lib/python3.7/ssl.py in do_handshake(self, block)
   1137             if timeout == 0.0 and block:
   1138                 self.settimeout(None)
-> 1139             self._sslobj.do_handshake()
   1140         finally:
   1141             self.settimeout(timeout)

SSLError: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:1076)

OR using the hub login address

gateway = Gateway(address='https://aws-uswest2.pangeo.io/services/dask-gateway',
                  proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
                  auth='jupyterhub')
---------------------------------------------------------------------------
HTTPClientError                           Traceback (most recent call last)
<ipython-input-26-bf15802dc141> in <module>
----> 1 cluster = gateway.new_cluster()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
    589             cluster_options=cluster_options,
    590             shutdown_on_close=shutdown_on_close,
--> 591             **kwargs,
    592         )
    593 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
    759             shutdown_on_close=shutdown_on_close,
    760             asynchronous=asynchronous,
--> 761             loop=loop,
    762         )
    763 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
    851             self.status = "starting"
    852         if not self.asynchronous:
--> 853             self.gateway.sync(self._start_internal)
    854 
    855     @property

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    310             )
    311             try:
--> 312                 return future.result()
    313             except BaseException:
    314                 future.cancel()

/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
    865             self._start_task = asyncio.ensure_future(self._start_async())
    866         try:
--> 867             await self._start_task
    868         except BaseException:
    869             # On exception, cleanup

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
    879             self.status = "starting"
    880             self.name = await self.gateway._submit(
--> 881                 cluster_options=self._cluster_options, **self._cluster_kwargs
    882             )
    883         # Connect to cluster

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
    475             headers=HTTPHeaders({"Content-type": "application/json"}),
    476         )
--> 477         resp = await self._fetch(req)
    478         data = json.loads(resp.body)
    479         return data["name"]

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _fetch(self, req)
    366                         raise GatewayServerError(msg)
    367                     else:
--> 368                         resp.rethrow()
    369         except HTTPError as exc:
    370             # Tornado 6 still raises these above with raise_error=False

/srv/conda/envs/pangeo/lib/python3.7/site-packages/tornado/httpclient.py in rethrow(self)
    675         """If there was an error on the request, raise an `HTTPError`."""
    676         if self.error:
--> 677             raise self.error
    678 
    679     def __repr__(self) -> str:

HTTPClientError: HTTP 503: Service Unavailable

@jcrist
Copy link
Member

jcrist commented Mar 18, 2020

Oh wait, I might have misunderstood what the proxy-public service was. Is that not the jupyterhub proxy?

Try without routing through jupyterhub to see if the service is up:

gateway = Gateway(address='http://XXXX(WEB-PUBLIC).us-west-2.elb.amazonaws.com',
                  proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
                  auth='jupyterhub')

gateway.list_clusters()

The address you pass to address should be a valid http(s) address that will eventually reach the gateway api server. Normally this is through the gateway's web-proxy (the web-public service above). If you're routing through jupyterhub's proxy, then this is jupyterhub's proxy address with the gateway's service path added (/services/dask-gateway/).

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

We got ours working using the web-public-XXXX-prod-dask-gateway external IP address for the gateway, not the proxy-public (JHub) address. I have not set up a static address yet, so not using DNS. Probably not an issue for you, but it never hurts to cut out DNS when troubleshooting stuff like this. You can also use curl from inside the cluster to see who is up and serving what. Some hints on that here: dask/dask-gateway#191. I'm just learning dask-gateway but happy to keep thinking on this with you.

@scottyhq
Copy link
Member

scottyhq commented Mar 18, 2020

Thanks @tjcrone and @jcrist. The proxying and network stuff is well out of my wheel-house so I feel I'm grasping in the dark here!

gateway = Gateway(address='http://XXXX(WEB-PUBLIC).us-west-2.elb.amazonaws.com',
                  proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
                  auth='jupyterhub')

gateway.list_clusters()

Gives this traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-0ecbfbe3dfc3> in <module>
      2                   proxy_address='tls://(REDACTED).us-west-2.elb.amazonaws.com:8786',
      3                   auth='jupyterhub')
----> 4 gateway.list_clusters()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in list_clusters(self, status, **kwargs)
    408         clusters : list of ClusterReport
    409         """
--> 410         return self.sync(self._clusters, status=status, **kwargs)
    411 
    412     def get_cluster(self, cluster_name, **kwargs):

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    310             )
    311             try:
--> 312                 return future.result()
    313             except BaseException:
    314                 future.cancel()

/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _clusters(self, status)
    387         url = "%s/gateway/api/clusters/%s" % (self.address, query)
    388         req = HTTPRequest(url=url)
--> 389         resp = await self._fetch(req)
    390         return [
    391             ClusterReport._from_json(self._public_address, self.proxy_address, r)

/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _fetch(self, req)
    360 
    361                     if resp.code in {404, 422}:
--> 362                         raise ValueError(msg)
    363                     elif resp.code == 409:
    364                         raise GatewayClusterError(msg)

ValueError: Not Found

but interestingly adding /services/dask-gateway to the XXXX(WEB-PUBLIC) succeeds and returns an empty list!

gateway = Gateway(address='http://XXXX(WEB-PUBLIC).us-west-2.elb.amazonaws.com/services/dask-gateway/',
                  proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
                  auth='jupyterhub')

gateway.list_clusters()

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

This is exactly how things went for me, and it only started working once you told me to add /services/dask-gateway/ to the address. Getting an empty list here is good! What happens when you try to start, scale up, and connect to a cluster:

cluster = gateway.new_cluster()
cluster.scale(8)
client = cluster.get_client()
client

@scottyhq
Copy link
Member

Yep @tjcrone it works!... but I'm not really following the changes here #520. My understanding was that without the https:// mapping to jupyterhub proxy the dask labextension can't connect to a cluster started via dask-gateway?

I'm confused about enabling https in general for the dask-gateway LoadBalancers. In jupyterhub there is a more explicit mapping:

  jupyterhub:
    proxy:
      https:
        hosts:
          - aws-uswest2.pangeo.io
        letsencrypt:
          contactEmail: scottyh@uw.edu
      service:
        loadBalancerIP: XXXX(PROXY-PUBLIC).us-west-2.elb.amazonaws.com

versus:

tlsURL: |
if isinstance(c.DaskGateway.public_connect_url, str):
c.DaskGateway.public_connect_url += "/services/dask-gateway"

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

@scottyhq, I'm still working on getting the Dask labextension working (and the dashboard), and making sure that TLS is working all the way through. I'll keep you posted as I learn more.

@scottyhq
Copy link
Member

One more thing I noticed looking at pod logs in case it's helpful is that the hub is loaded with Unexpected error connecting to web-public-dev-staging-dask-gateway:80 messages

kubectl logs hub-7844c8f9cb-2jrhf -n icesat2-prod

[I 2020-03-18 01:12:14.804 JupyterHub proxy:320] Checking routes
[E 2020-03-18 01:12:14.916 JupyterHub utils:75] Unexpected error connecting to web-public-dev-staging-dask-gateway:80 [Errno -2] Name or service not known
[E 2020-03-18 01:12:14.988 JupyterHub utils:75] Unexpected error connecting to web-public-dev-staging-dask-gateway:80 [Errno -2] Name or service not known
[E 2020-03-18 01:12:15.304 JupyterHub utils:75] Unexpected error connecting to web-public-dev-staging-dask-gateway:80 [Errno -2] Name or service not known
[E 2020-03-18 01:12:15.804 JupyterHub utils:75] Unexpected error connecting to web-public-dev-staging-dask-gateway:80 [Errno -2] Name or service not known
[W 2020-03-18 01:12:15.804 JupyterHub app:1903] Cannot connect to external service dask-gateway at http://web-public-dev-staging-dask-gateway

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

Same thing on our cluster.

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

@scottyhq, I am also getting the 503 Service Unavailable when trying to start dask-gateway through proxy-public. And when I try to start a cluster using web-public-ooi-prod-dask-gateway, I need to add /services/dask-gateway, which is interesting. This doesn't seem to align with the suggestions from @jcrist or @TomAugspurger, or the docs for that matter. So I'm not sure what is going on.

@jcrist
Copy link
Member

jcrist commented Mar 18, 2020

This doesn't seem to align with the suggestions from @jcrist or @TomAugspurger, or the docs for that matter. So I'm not sure what is going on.

Apologies, I forgot about this bit.

The configuration here is to run dask-gateway as a JupyterHub service, which involves proxying requests through JupyterHub's proxy (and can thus rely on JupyterHub for HTTPS support). So a request looks like:

user -> jupyterhub-proxy (HTTPS) -> dask-gateway-proxy (HTTP) -> dask-gateway-server (HTTP)

Services are proxied on /services/{service-name} - in this case this is /services/dask-gateway/. Because JupyterHub's proxy doesn't support stripping prefixes, requests arrive to the dask-gateway-proxy with the /services/dask-gateway/ prefix (e.g. what would normally be a /api/clusters/ call is now a /services/dask-gateway/api/clusters/ call). To support this, dask-gateway has been configured to run with the same prefix prepended for all routes (this is the bit I forgot about when making suggestions above). So you do need the /services/dask-gateway/ path whether you're connecting through the JupyterHub proxy or directly through the dask-gateway proxy.

From the above, it looks like you're successfully connecting and running when handled directly, but unsuccessful when connecting through the JupyterHub proxy. Is the service configured properly? Is your configuration available somewhere so I can help debug?

One more thing I noticed looking at pod logs in case it's helpful is that the hub is loaded with Unexpected error connecting to web-public-dev-staging-dask-gateway:80 messages

I assume this happens at startup, but not later in the logs? When JupyterHub is starting up it tries to connect to all registered services (it also performs these checks periodically). If dask-gateway is also starting up at the same time, it may fail these checks initially (since it isn't running yet). Shouldn't be anything to worry about.

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

@jcrist, thank you for this helpful clarification. I the bulk of our configuration options are here: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/pangeo-deploy/values.yaml. The per-deployment configs are, as an example, here: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/ooi/config/common.yaml. Looks to me like in our deployment (ooi) we are not specifying that the service start, as @TomAugspurger did here:

url: http://web-public-dev-staging-dask-gateway

@jcrist
Copy link
Member

jcrist commented Mar 18, 2020

Ah, yeah, without the url dask-gateway wont be proxied behind the jupyterhub proxy.

We might be able to automate that configuration with some helm magic so that each pangeo deployment doesn't need to manually add that url. For now though that's likely http://web-public-ooi-staging-dask-gateway or something like that.

@jhamman
Copy link
Member Author

jhamman commented Mar 18, 2020

See also pangeo-data/helm-chart#126

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

Okay that seemed to work @jcrist! I'm not sure why, but our version of dask-gateway appears to be 0.3.0, at least in my image, and that was a problem that I solved temporarily with a conda install. But I still cannot access the dashboard. Now getting a 500 Internal Server Error.

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

Oh I see, the cluster object provides the right dashboard url, but the client object does not. Getting closer!!

@tjcrone
Copy link
Contributor

tjcrone commented Mar 18, 2020

Making tons of progress thanks to help from all y'alls. Thank you very much! Here's a comment in my recent PR that probably belongs in this thread: #568 (comment)

@scottyhq
Copy link
Member

Confirmed this is now working on AWS with

gateway = Gateway(address='https://staging.aws-uswest2.pangeo.io/services/dask-gateway',
                  proxy_address='tls://scheduler-public-icesat2-staging-dask-gateway:8786',
                  auth='jupyterhub')

@scottyhq
Copy link
Member

@jhamman and @TomAugspurger. Is it possible to connect to the gateway in GKE from aws-uswest2.pangeo.io with the current setup? What address, proxy, and auth would need to be provided in that case?

@TomAugspurger
Copy link
Member

Good question @scottyhq! By default, things don't quite work as easily as

gateway = Gateway(address='https://staging.aws-uswest2.pangeo.io/services/dask-gateway',
                  proxy_address='tls://scheduler-public-icesat2-staging-dask-gateway:8786',
                  auth='jupyterhub')

because the two jupyterhubs have their own API tokens. My GKE API token stored at JUPYTERHUB_API_TOKEN isn't valid for the aws-uswest2 hub. However, if I manually create an API token on the AWS cluster at https://staging.aws-uswest2.pangeo.io/hub/token

from dask_gateway import Gateway
from dask_gateway.auth import JupyterHubAuth

auth = JupyterHubAuth(api_token="<my-token>")
gateway = Gateway(address='https://staging.aws-uswest2.pangeo.io/services/dask-gateway',
                  proxy_address='tls://scheduler-public-icesat2-staging-dask-gateway:8786',
                  auth=auth)

Then things work! It'd be great if there were a way to "sync" API tokens between two hubs, since we're both using GitHub for auth, but I don't know if that's possible.

@TomAugspurger
Copy link
Member

Well, things kinda work. I'm able to connect to the us-west gateway, but I can't create a cluster.

cluster = gateway.new_cluster()
---------------------------------------------------------------------------
GatewayClusterError                       Traceback (most recent call last)
<ipython-input-21-a899aa24fb70> in <module>
----> 1 cluster = gateway.new_cluster()
      2 cluster

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
    589             cluster_options=cluster_options,
    590             shutdown_on_close=shutdown_on_close,
--> 591             **kwargs,
    592         )
    593 

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
    759             shutdown_on_close=shutdown_on_close,
    760             asynchronous=asynchronous,
--> 761             loop=loop,
    762         )
    763 

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
    851             self.status = "starting"
    852         if not self.asynchronous:
--> 853             self.gateway.sync(self._start_internal)
    854 
    855     @property

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    310             )
    311             try:
--> 312                 return future.result()
    313             except BaseException:
    314                 future.cancel()

/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
    865             self._start_task = asyncio.ensure_future(self._start_async())
    866         try:
--> 867             await self._start_task
    868         except BaseException:
    869             # On exception, cleanup

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
    883         # Connect to cluster
    884         try:
--> 885             report = await self.gateway._wait_for_start(self.name)
    886         except GatewayClusterError:
    887             raise

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _wait_for_start(self, cluster_name)
    524                     raise GatewayClusterError(
    525                         "Cluster %r failed to start, see logs for "
--> 526                         "more information" % cluster_name
    527                     )
    528                 elif report.status is ClusterStatus.STOPPED:

GatewayClusterError: Cluster '545e13e9207d4dada507bf4a58adce79' failed to start, see logs for more information

I don't have access to the logs.

@jcrist
Copy link
Member

jcrist commented Mar 24, 2020

If things are set up to use your notebook image by default, could it be that the image isn't available on the other instal (e.g. using Amazon's/Google's image registry).

We could set up multiple gateway's for a single hub instance (and I have some ideas for how to do it for multiple hubs per gateway as well) but the current setup is really optimized for the 1:1 case. Manually getting a token does work, but isn't ideal.

@scottyhq
Copy link
Member

scottyhq commented Mar 24, 2020

Thanks for looking into this @TomAugspurger and @jcrist. I think this ability to connect to various clusters from a single hub would be extremely powerful, but it does also introduce more complications. In particular it opens up the possibility of users streaming a lot of data between cloud-providers and incurring huge egress costs. So for now sticking with the 1:1 options seems best. In the future though, is there a way to set per-user network transfer limits in a similar way to CPU/RAM limits?

Then things work! It'd be great if there were a way to "sync" API tokens between two hubs, since we're both using GitHub for auth, but I don't know if that's possible.

At least for hubs managed in this repo, we can use the same API tokens in a shared secrets file, right? Security wise that's probably not ideal, but maybe we can rotate them periodically.

If things are set up to use your notebook image by default, could it be that the image isn't available on the other instal (e.g. using Amazon's/Google's image registry).

Good point! This is another argument for storing all our images on DockerHub. In our case I think we don't really need private registries for these jupyterhub images.

@TomAugspurger
Copy link
Member

At least for hubs managed in this repo, we can use the same API tokens in a shared secrets file, right? Security wise that's probably not ideal, but maybe we can rotate them periodically.

I wondered about that. I think that the API key in question is randomly generated by the Hub, not by us. But it'd be good to verify that.

I think Jim is right that the image is likely the culprit. If I have time later today I'll try the other direction (connecting to the GKE hub from AWS), since I have access to the GKE logs.

@tjcrone
Copy link
Contributor

tjcrone commented Mar 27, 2020

Great progress lately on Dask-gateway. This will be a welcome improvement!

One thing I'm wondering about is a user's choice of worker image. I was under the impression, perhaps mistakenly, that Dask-gateway would allow us to control which images are run on worker nodes. Is this the case? Or instead, does the architecture of Dask-gateway reduce the risk of allowing users to deploy arbitrary images in the unlikely case that they would be used malevolently? I see that recent changes define the worker image in a user-space environment variable, so any help in understanding this would be greatly appreciated. Thanks!

@TomAugspurger
Copy link
Member

I was under the impression, perhaps mistakenly, that Dask-gateway would allow us to control which images are run on worker nodes. Is this the case?

You're correct. Currently we provide users the option to pick an arbitrary image, and set the default to '{JUPYTER_IMAGE_SPEC}', the image their hub singleuser instance is using.

  1. extraEnv:
    # The default worker image matches the singleuser image.
    DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'

You can override the option_handler to provide a whitelist of images to pick from.

@TomAugspurger
Copy link
Member

TomAugspurger commented Mar 27, 2020

I'm able to use Dask-Gateway to connect to & create clusters from staging dev, ocean, and hydro. With #576, things should be ready on prod as well (after a staging -> prod merge).

The version of distributed in hydro is a bit old to connect a client to it. It would require GatewayCluster.get_client() instead. Is hydro still in use?

@TomAugspurger
Copy link
Member

Everything seems to be working well on prod. I think we're good here.

@scottyhq
Copy link
Member

scottyhq commented Apr 3, 2020

@jhamman @TomAugspurger - While running some additional tests I'm realizing one drawback of the putting the schedulers on a separate nodegroup that scales to zero (#569) is that it can take a long time for the new_cluster() command to complete. The main issue is there is no feedback, just seems like the kernel is hanging for 5minutes:

%%time
from dask_gateway import Gateway
from dask.distributed import Client, progress

gateway = Gateway()
cluster = gateway.new_cluster()

CPU times: user 76.3 ms, sys: 30.7 ms, total: 107 ms
Wall time: 5min 29s

@TomAugspurger
Copy link
Member

TomAugspurger commented Apr 4, 2020 via email

@scottyhq
Copy link
Member

scottyhq commented May 5, 2020

closed by #577

@scottyhq scottyhq closed this as completed May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants