Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel restarts and "Bad Gateway" nuisance pop-ups #1170

Closed
jameshowison opened this issue Apr 5, 2022 · 9 comments
Closed

Kernel restarts and "Bad Gateway" nuisance pop-ups #1170

jameshowison opened this issue Apr 5, 2022 · 9 comments

Comments

@jameshowison
Copy link

Context

On the utexas hub I'm having some nuisance issues causing popups about Kernel restarts. This was one of the issues discussed in #1021 but we thought it was just xsql related. I haven't used xsql for weeks and it's still happening.

It primarily happens on startup where I get the pop up shown in #1021 (comment) I think this is related to files that were open when the hub suspended. Perhaps that is why it is hard to replicate, need to have "open" files with running kernels?

Before the Bad Gateway message I see a message something like "Kernel is not running, select kernel" but whichever one I select from the drop-down it pops up a few more times before disappearing.

I'll try to capture it in action with screenshots or a screencast.

As I say this is more of a nuisance than a big issue, but it's not a great look when demoing.

Proposal

No response

Updates and actions

No response

@jameshowison
Copy link
Author

Here's a screenvideo of what I'm dealing with. This typically happens on first log in. I keep selecting things until it goes away. I think that means that the logical selection eventually works, so it does seem like some sort of race condition.

in.mov

@jameshowison
Copy link
Author

This might still be related to xeus. Here are the logs from around when this was happening (you can also look on the server for the full logs).

[I 2022-04-07 15:06:44.440 SingleUserNotebookApp log:189] 200 GET /user/jameshowison/lab/api/settings/@jupyterlab/application-extension:context-menu?1649344004359 (jameshowison@10.128.0.47) 14.72ms
Exception in callback <TaskStepMethWrapper object at 0x7f04916970d0>()
handle: <Handle <TaskStepMethWrapper object at 0x7f04916970d0>()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: Cannot enter into task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:158> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> while another task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:176> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> is being executed.
Exception in callback <TaskStepMethWrapper object at 0x7f04916c29d0>()
handle: <Handle <TaskStepMethWrapper object at 0x7f04916c29d0>()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: Cannot enter into task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:158> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> while another task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:176> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> is being executed.
Exception in callback <TaskStepMethWrapper object at 0x7f04916fced0>()
handle: <Handle <TaskStepMethWrapper object at 0x7f04916fced0>()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: Cannot enter into task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:158> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> while another task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:176> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> is being executed.
Exception in callback <TaskStepMethWrapper object at 0x7f04916fc890>()
handle: <Handle <TaskStepMethWrapper object at 0x7f04916fc890>()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: Cannot enter into task <Task pending coro=<HTTP1ServerConnection._server_request_loop() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/http1connection.py:815> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> while another task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:176> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> is being executed.
Exception in callback <TaskStepMethWrapper object at 0x7f049185c610>()
handle: <Handle <TaskStepMethWrapper object at 0x7f049185c610>()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: Cannot enter into task <Task pending coro=<HTTP1ServerConnection._server_request_loop() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/http1connection.py:815> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> while another task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:176> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> is being executed.
Exception in callback <TaskStepMethWrapper object at 0x7f0492355e10>()
handle: <Handle <TaskStepMethWrapper object at 0x7f0492355e10>()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: Cannot enter into task <Task pending coro=<HTTP1ServerConnection._server_request_loop() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/http1connection.py:815> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> while another task <Task pending coro=<MappingKernelManager.start_kernel() running at /srv/conda/envs/notebook/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py:176> cb=[IOLoop.add_future.<locals>.<lambda>() at /srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py:688]> is being executed.
registering handler for SIGSEGV
Starting xeus-sql kernel...

If you want to connect to this kernel from an other client, you can use the /tmp/.jupyter-runtime/kernel-3920c266-3b07-48d1-8540-778f692a26e7.json file.
Run with XEUS 2.3.1
[I 2022-04-07 15:06:44.505 SingleUserNotebookApp kernelmanager:179] Kernel started: 3920c266-3b07-48d1-8540-778f692a26e7, name: xsql
[I 2022-04-07 15:06:44.505 SingleUserNotebookApp kernelmanager:443] Culling kernels with idle durations > 3600 seconds at 300 second intervals ...
[I 2022-04-07 15:06:44.505 SingleUserNotebookApp kernelmanager:447] Culling kernels even with connected clients

@yuvipanda
Copy link
Member

So I poked around a bit, and was able to repro this fairly easily.

There were no 504 errors in the notebook log, but I do see 504 errors in the nginx log and the proxy process. In the proxy process, I can see:

19:59:17.015 [ConfigProxy] error: 503 POST /user/yuvipanda/api/sessions socket hang up

And that correlates with the entries from the nginx log:

10.128.0.53 - - [08/Apr/2022:19:59:17 +0000] "POST /user/yuvipanda/api/sessions?1649447896873 HTTP/2.0" 504 160 "https://utexas.pilot.2i2c.cloud/user/yuvipanda/lab?" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:98.0) Gecko/20100101 Firefox/98.0" 217 60.001 [utexas-proxy-public-http] [] 10.0.2.216:8000 0 60.000 504 aec4099cfed8d8f96da9616a350456ed

So it looks like /user/api/sessions is sometimes taking so long that everything times out - you can see it took 60.001 seconds before it timed out. So most likely something is blocking the process there, and not sure what that is.

I don't have the log lines from xeus-sql you're seeing yet.

@yuvipanda
Copy link
Member

Aha, maybe jupyter/notebook#6164 is what you are experiencing?

@yuvipanda
Copy link
Member

Maybe 2i2c-org/utexas-image#7 can fix it, as jupyter/notebook#6221 has been merged. However, the current image is on notebook 6.3.0, so I've upgraded it to latest in 2i2c-org/utexas-image#7. Let's see if that fixes it.

yuvipanda added a commit to yuvipanda/repo2docker that referenced this issue Apr 8, 2022
- We should stop auto updating 3.6 now, so we can move on to
  newer notebook packages - a new notebook version ought to fix
  issues like 2i2c-org/infrastructure#1170
  for example.
- No longer explicitly pin nbconvert, as whatever version notebook
  brings in should be sufficient.
- Bump up ipywidgets and jupyterlab while we're at it.
- No longer pin jupyterhub-singleuser either, as that too works
  properly with 3.7
@yuvipanda
Copy link
Member

Ok, so i deployed the newly built image to the hub, and it looks like it fixed the annoying popups! You also got RStudio for free :)

Let me know if this recurs, @jameshowison! jupyterhub/repo2docker#1149 should also fix this longer term for other hubs.

@jameshowison
Copy link
Author

Great. So far so good. I'll let you know if we see any recurrences, but this sure seems well diagnosed. And I'm stoked that this helps with other images/upstream :)

yuvipanda added a commit to yuvipanda/repo2docker that referenced this issue Apr 11, 2022
- We should stop auto updating 3.6 now, so we can move on to
  newer notebook packages - a new notebook version ought to fix
  issues like 2i2c-org/infrastructure#1170
  for example.
- No longer explicitly pin nbconvert, as whatever version notebook
  brings in should be sufficient.
- Bump up ipywidgets and jupyterlab while we're at it.
- No longer pin jupyterhub-singleuser either, as that too works
  properly with 3.7
@yuvipanda
Copy link
Member

@choldgraf tweeted about this https://twitter.com/choldgraf/status/1513612334815535106

@jameshowison
Copy link
Author

Love it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants