Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telnet console randomly stops responding #1335

Closed
PatrickBostwick opened this issue May 21, 2018 · 33 comments
Closed

telnet console randomly stops responding #1335

PatrickBostwick opened this issue May 21, 2018 · 33 comments

Comments

@PatrickBostwick
Copy link

Telnet console randomly stops responding requiring a reboot of the device to get it to respond. Networking between devices seems to still work when this happens however and I can even telnet or ssh into the device whose console no longer works. Below is log output. I am running Ubuntu 18.04 x86_64 with 32GB of ram.

2018-05-20 13:55:07 ERROR base_events.py:1266 Fatal read error on socket transport
protocol: <asyncio.streams.StreamReaderProtocol object at 0x7f08c580a4e0>
transport: <_SelectorSocketTransport fd=43 read=polling write=<polling, bufsize=2493>>
Traceback (most recent call last):
File "/usr/lib/python3.6/asyncio/selector_events.py", line 723, in _read_ready
data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out
2018-05-20 13:55:07 ERROR base_events.py:1266 Fatal read error on socket transport
protocol: <asyncio.streams.StreamReaderProtocol object at 0x7f08c695b6a0>
transport: <_SelectorSocketTransport fd=48 read=polling write=<polling, bufsize=2493>>
Traceback (most recent call last):
File "/usr/lib/python3.6/asyncio/selector_events.py", line 723, in _read_ready
data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out
2018-05-20 13:55:07 ERROR base_events.py:1266 Task exception was never retrieved
future: <Task finished coro=<AsyncioTelnetServer.run() done, defined at /usr/share/gns3/gns3-server/lib/python3.6/site-packages/gns3server/utils/asyncio/telnet_server.py:188> exception=TimeoutError(110, 'Connection timed out')>
Traceback (most recent call last):
File "/usr/share/gns3/gns3-server/lib/python3.6/site-packages/gns3server/utils/asyncio/telnet_server.py", line 197, in run
yield from self._process(network_reader, network_writer, connection)
File "/usr/share/gns3/gns3-server/lib/python3.6/site-packages/gns3server/utils/asyncio/telnet_server.py", line 259, in _process
data = coro.result()
File "/usr/lib/python3.6/asyncio/streams.py", line 634, in read
yield from self._wait_for_data('read')
File "/usr/lib/python3.6/asyncio/streams.py", line 464, in _wait_for_data
yield from self._waiter
File "/usr/lib/python3.6/asyncio/selector_events.py", line 723, in _read_ready
data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out

@CapnCheapo
Copy link
Contributor

What version of QEMU are you running? I had this problem when we upgraded to a recent version but the problem went away when we went back to 2.5.0.

@PatrickBostwick
Copy link
Author

Hmmm, it looks like I'm using 2.11.1

dpkg --list | grep qemu

ii ipxe-qemu 1.0.0+git-20180124.fbe8c52d-0ubuntu2 all PXE boot firmware - ROM images for qemu
ii ipxe-qemu-256k-compat-efi-roms 1.0.0+git-20150424.a25a16d-0ubuntu2 all PXE boot firmware - Compat EFI ROM images for qemu
ii qemu 1:2.11+dfsg-1ubuntu7.1 amd64 fast processor emulator
ii qemu-block-extra:amd64 1:2.11+dfsg-1ubuntu7.1 amd64 extra block backend modules for qemu-system and qemu-utils
ii qemu-kvm 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU Full virtualization on x86 hardware
ii qemu-slof 20170724+dfsg-1ubuntu1 all Slimline Open Firmware -- QEMU PowerPC version
ii qemu-system 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries
ii qemu-system-arm 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries (arm)
ii qemu-system-common 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-mips 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries (mips)
ii qemu-system-misc 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries (miscellaneous)
ii qemu-system-ppc 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries (ppc)
ii qemu-system-s390x 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries (s390x)
ii qemu-system-sparc 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries (sparc)
ii qemu-system-x86 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU full system emulation binaries (x86)
ii qemu-user 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU user mode emulation binaries
ii qemu-user-binfmt 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU user mode binfmt registration for qemu-user
ii qemu-utils 1:2.11+dfsg-1ubuntu7.1 amd64 QEMU utilities

qemu-x86_64 --version

qemu-x86_64 version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.1)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

I'll look into downgrading to 2.5. Thanks!

@grossmj grossmj closed this as completed Jun 11, 2018
@PatrickBostwick
Copy link
Author

PatrickBostwick commented Jun 11, 2018 via email

@spikefishjohn
Copy link
Contributor

Hi all, i'm getting this same error with ubuntu 16.04 and kvm 2.5. I'm trying to gather more info.

ii ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1.2 all PXE boot firmware - ROM images for qemu
ii qemu-block-extra:amd64 1:2.5+dfsg-5ubuntu10.33 amd64 extra block backend modules for qemu-system and qemu-utils
ii qemu-kvm 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU Full virtualization
ii qemu-slof 20151103+dfsg-1ubuntu1.1 all Slimline Open Firmware -- QEMU PowerPC version
ii qemu-system 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU full system emulation binaries
ii qemu-system-arm 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU full system emulation binaries (arm)
ii qemu-system-common 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-mips 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU full system emulation binaries (mips)
ii qemu-system-misc 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU full system emulation binaries (miscelaneous)
ii qemu-system-ppc 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU full system emulation binaries (ppc)
ii qemu-system-sparc 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU full system emulation binaries (sparc)
ii qemu-system-x86 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU full system emulation binaries (x86)
ii qemu-utils 1:2.5+dfsg-5ubuntu10.33 amd64 QEMU utilities
root@compute01:/var/log#

What is strange is i can see putty has connected to python via tcpdump but no data is showing up. Seems like VNC consoles doesn't have this problem. I think i noticed too many irqs on a serial port on a different vm. I can't login to the vm remotely yet but i want to start adding to this.

@grossmj grossmj reopened this Dec 10, 2018
@grossmj grossmj added this to the 2.2 milestone Dec 10, 2018
@grossmj grossmj added the Bug label Dec 10, 2018
@spikefishjohn
Copy link
Contributor

playing around with tcpdump and strace. I verified i can see the telnet session going into python. I ran strace against python and noticed i can see data coming back from the remote console.

egrep '6758 recvfrom(18' output.txt

and i see

6758 recvfrom(18, "\n\33[H\33[JThis system is for authorized use only.\r\nlogin: ", 262144, 0, NULL, NULL) = 55
6758 recvfrom(18, "T", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "H", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "I", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "S", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, " ", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "I", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "S", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, " ", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "A", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, " ", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "TE", 262144, 0, NULL, NULL) = 2
6758 recvfrom(18, "S", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "T", 262144, 0, NULL, NULL) = 1
6758 recvfrom(18, "\r\n", 262144, 0, NULL, NULL) = 2
6758 recvfrom(18, "PASS", 262144, 0, NULL, NULL) = 4
6758 recvfrom(18, "WORD: ", 262144, 0, NULL, NULL) = 6

So this is the login prompt for the remote box.

I don't really seem much going on between recvfrom calls.

6758 recvfrom(18, "Logi", 262144, 0, NULL, NULL) = 4
6758 clock_gettime(CLOCK_MONOTONIC, {330709, 205034534}) = 0
6758 clock_gettime(CLOCK_MONOTONIC, {330709, 205115342}) = 0
6758 epoll_wait(3, [{EPOLLIN, {u32=18, u64=9721862949299027986}}], 69, 46) = 1
6758 clock_gettime(CLOCK_MONOTONIC, {330709, 205276121}) = 0
6758 recvfrom(18, "n incor", 262144, 0, NULL, NULL) = 7
6758 clock_gettime(CLOCK_MONOTONIC, {330709, 205457783}) = 0
6758 clock_gettime(CLOCK_MONOTONIC, {330709, 205543689}) = 0
6758 epoll_wait(3, [{EPOLLIN, {u32=18, u64=9721862949299027986}}], 69, 45) = 1
6758 clock_gettime(CLOCK_MONOTONIC, {330709, 205702582}) = 0

I'm not sure where to proceed. my unix trouble shooting skills are about at an end.

@spikefishjohn
Copy link
Contributor

Oh and i sshed directly to the host. Nothing odd showing up in dmesg or /var/log/*.

@spikefishjohn
Copy link
Contributor

spikefishjohn commented Dec 10, 2018

BTW i'm running GNS3 2.1.11 on bare metal Ubuntu 16.04 with kernel 4.15.0-42-generic.

@spikefishjohn
Copy link
Contributor

Well.. i got brave.. i looked up gdb debugging and python.. i attached gdb ran a bt existed out and the console started working.
:/

@spikefishjohn
Copy link
Contributor

I didn't get any softlock messages in dmesg. Also if kvm was the issue i would expect that strace.. ing (is that a verb?) gns3server wouldn't have shown the login console and all that jazz. I feel like the issue or at least the one i'm reporting its on the gns3server. Could be wrong. I also don't understand why stopping gns3server via gdb would have fixed it somehow. I mean if kvm was going bonkers how could python be the advil. I'm going to upgrade stuff to 18.04 and see what happens. BTW my gns3server is a dell R620 - E5-2660. I'll report back on how that goes.

@grossmj
Copy link
Member

grossmj commented Dec 14, 2018

I don't think kvm is the problem either. Have you tried different Qemu VMs or a different terminal?

@spikefishjohn
Copy link
Contributor

spikefishjohn commented Dec 14, 2018 via email

@spikefishjohn
Copy link
Contributor

oh and to be clear, i had 2 VMs built from the same image attached to the same network. One worked the other didn't. Now they both seem to be fine. odd.

@spikefishjohn
Copy link
Contributor

Seems like it hasn't come back since upgrading. I'll update the bug report if it pops up again.

@grossmj
Copy link
Member

grossmj commented Dec 21, 2018

Ok, please post here if it happens again. Thanks.

@grossmj grossmj closed this as completed Dec 21, 2018
@blargh2015
Copy link

blargh2015 commented Dec 27, 2018

We seem to be having this identical issue on our install. Ubuntu 16.04.3, QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.33), Copyright (c) 2003-2008 Fabrice Bellard, gns3server version 2.1.0.

Reproduction: It's not immediate, and a restart of the emulated router does clear the problem for a few hours, but it always comes back, and once it starts happening it keeps happening (and thus the console is unusable) until the router VM is restarted. When the problem occurs, I telnet to the host running the VM (which is on socket 18 in the trace below), and gns3server relays to the host successfully via socket 37, gets a response back from socket 37, but never sends that respond back out socket 18 to my client:

An strace -p shows when I hit enter on the telnet session, the Enter is relayed, and a "Login:" prompt comes back from the router VM, but is never relayed by the gns3server process:

[pid 2702] epoll_wait(3, [{EPOLLIN, {u32=18, u64=6583065115254325266}}], 29, 100) = 1
[pid 2702] recvfrom(18, "\r", 262144, 0, NULL, NULL) = 1
[pid 2702] epoll_wait(3, [], 29, 0) = 0
[pid 2702] epoll_wait(3, [], 29, 0) = 0
[pid 2702] epoll_wait(3, [], 29, 0) = 0
[pid 2702] sendto(37, "\r", 1, 0, NULL, 0) = 1
[pid 2702] epoll_wait(3, [], 29, 0) = 0
[pid 2702] epoll_wait(3, [{EPOLLIN, {u32=37, u64=6583065115254325285}}], 29, 51) = 1
[pid 2702] recvfrom(37, "\r\r", 262144, 0, NULL, NULL) = 2
[pid 2702] epoll_wait(3, [{EPOLLIN, {u32=37, u64=6583065115254325285}}], 29, 51) = 1
[pid 2702] recvfrom(37, "\nLo", 262144, 0, NULL, NULL) = 3
[pid 2702] epoll_wait(3, [{EPOLLIN, {u32=37, u64=6583065115254325285}}], 29, 51) = 1
[pid 2702] recvfrom(37, "gin", 262144, 0, NULL, NULL) = 3
[pid 2702] epoll_wait(3, [{EPOLLIN, {u32=37, u64=6583065115254325285}}], 29, 50) = 1
[pid 2702] recvfrom(37, ": ", 262144, 0, NULL, NULL) = 2
[pid 2702] epoll_wait(3, [], 29, 50) = 0
[pid 2702] epoll_wait(3, [], 29, 0) = 0

lsof of sockets 18 and 37:
$ sudo lsof -i -a -p 2702 | grep 18u
gns3serve 2702 gns3 18u IPv4 1587885 0t0 TCP (the server):5000->(my laptop):58995 (ESTABLISHED)

$ sudo lsof -i -a -p 2702 | grep 37u
gns3serve 2702 gns3 37u IPv4 1211956 0t0 TCP localhost:53070->localhost:5001 (ESTABLISHED)

The matching local port 5001 binding:
$ sudo lsof -n | grep ":5001 (LISTEN)"
qemu-syst 30932 gns3 21u IPv4 1195504 0t0 TCP 127.0.0.1:5001 (LISTEN)

@blargh2015
Copy link

For what it's worth, we have a sneaking suspicion (nothing yet to prove this is the case) that the issue may be related to a firewall that's in between my laptop and the server that may be removing TCP sessions from its mapping table after a couple hours. The server is showing additional connections still open to port 5000, but those telnet sessions are long since gone:

gns3serve 2702 gns3 11u IPv4 1009532 0t0 TCP (server):5000->(other laptop):53368 (ESTABLISHED)
gns3serve 2702 gns3 13u IPv4 992572 0t0 TCP (server):5000->(other laptop):52640 (ESTABLISHED)
gns3serve 2702 gns3 18u IPv4 1587885 0t0 TCP (server):5000->(my laptop):58995 (ESTABLISHED)

I left socket 18 (my telnet session) open over night, and this morning, the telnet client still acted connected, until I hit Enter, and almost immediately got "Connection closed by foreign host."

However, AFTER that, socket 18 is STILL showing ESTABLISHED on the server with identical output as above.

Are TCP keepalives enabled on the TCP socket between the Telnet connection sharing proxy on GNS3 and the various clients (the laptop->GNS3 session)? It doesn't look like it:
$ sudo netstat -to | grep 5000
tcp 0 0 (server):5000 (other laptop):53368 ESTABLISHED off (0.00/0/0)
tcp 0 0 (server):5000 (my laptop):58995 ESTABLISHED off (0.00/0/0)
tcp 0 0 (server):5000 (other laptop):52640 ESTABLISHED off (0.00/0/0)
(the "off" part would be 'keepalive' if it was enabled)

I'm not familiar enough yet with asyncio to figure out how to set it in that library, although with Python's low level socket call it'd be
socket_.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)

That would help at least clean up the hanging sockets. I'm not sure what chaos this may be causing inside gns3. I'm rooting through telnet_server.py and base_node.py trying to sort through it, but work keeps pulling me away.

@grossmj
Copy link
Member

grossmj commented Jan 12, 2019

Thanks for troubleshooting this. I have added the socket options SO_KEEPALIVE and TCP_NODELAY which hopefully should help/fix this issue.

@blargh2015
Copy link

Thanks for troubleshooting this. I have added the socket options SO_KEEPALIVE and TCP_NODELAY which hopefully should help/fix this issue.

Thanks - I snagged the patch and applied it to one of our compute nodes. I should hopefully know in a day or so if this was the fix or not.

@grossmj
Copy link
Member

grossmj commented Jan 14, 2019

Excellent, that would be awesome to confirm if the patch fixes this issue.

@grossmj
Copy link
Member

grossmj commented Jan 17, 2019

@blargh2015 have you had a chance to test? :)

@blargh2015
Copy link

So far it looks good - our test VMs on one of our compute nodes continues to work through today. Various CLI commands also indicate the hanging sockets aren't sticking around anymore.

I'm going to apply it to all of our compute nodes later today and make sure our other engineers are hammering it.

tl;dr: Looking good so far - let me spread it wider and let it soak another couple days before the 'all clear'.

@blargh2015
Copy link

So a couple days in, and things look good - seems like that patch nailed the problem. I think we're OK to close the issue.

@grossmj
Copy link
Member

grossmj commented Jan 19, 2019

Thanks for your help 👍

@grossmj grossmj closed this as completed Jan 19, 2019
@VarunDelft
Copy link

Hi, I am facing the same issue (same socket transport error in the server log), with a fresh GNS3 server 2.2.5 install on a bare-metal server. Ubuntu 18.04.4 LTS and QEMU 2.11.
How can I resolve this ? I gather that my installation already has the above mentioned patch.

@spikefishjohn
Copy link
Contributor

I'm still seeing it as well with 2.2.3.

VaranDelft - bb284d0

That is the url that has the patch. It should already be in 2.2.5.

My question is I don't understand how this fixes anything. I'm by no means a python expert but it looks like a socket object has been instantiation without being used? Not following how this is setting the socket options on the async io.

@VarunDelft
Copy link

Yes, the patch is there in 2.2.5. I checked the content of the telnet_server.py in my installation. Looks like the patch only forces a 'keepalive' on the TCP socket connection, preventing it from timing out ? Anyways, that doesn't seem to be solving the problem completely.

@spikefishjohn
Copy link
Contributor

oh, i just re-read the python code I get it now. socket is a ref to the writer object that was passed in.

@spikefishjohn
Copy link
Contributor

I was reading and it looks like maybe tcp keepalives don't start firing off for 2 hours and this isn't configurable via socket call on linux. I think 1 hour is what most cool kids are configuring for a timeout on stateful firewalls.

https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

I'm using that to lower the timers and will see if that helps things at all.

@VarunDelft
Copy link

VarunDelft commented Apr 21, 2020

@spikefishjohn Thanks for sharing the link. I have decreased the time to 3600 s in my environment now. Let's see how it goes. What are your observations ?
On a different note: the link also says :

"Programs must request keepalive control for their sockets using the setsockopt interface"

Isn't this what the patch is trying to do as well ?

@spikefishjohn
Copy link
Contributor

So far I haven't seen any other issues since enabling the timers using the usingkeepalive.html link. 3600 might be too close to default tcp timeout. I would do 30 mins or maybe even 15. Basically anything below any stateful packet timer.

The patch makes it so keep alive CAN be sent but the rate in which they are sent isn't configured at a socket level only at the OS level.

@VarunDelft
Copy link

In my case, I'm still observing the issue in a few devices. So I'll probably need to try reducing the time interval and see if it helps.

@e-cnelis
Copy link

e-cnelis commented May 7, 2023

hi, I have the same issue on GNS3 2.2.35.1. The decrease of tcp timers didn't help. Do you have some advice on how to get rid of this problem ?

@spikefishjohn
Copy link
Contributor

@e-cnelis @VarunDelft @blargh2015 @PatrickBostwick I know this is a super old bug, but i think the real fix for this is now in the the latest release of GNS3. Patch here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants