Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for tunxxx to become free #18

Closed
bernhardschmidt opened this issue Jan 5, 2023 · 31 comments
Closed

Waiting for tunxxx to become free #18

bernhardschmidt opened this issue Jan 5, 2023 · 31 comments
Assignees
Labels
bug Something isn't working

Comments

@bernhardschmidt
Copy link

As far as I understand this is already known?

When stopping an OpenVPN instance, the kernel log is getting messages like


Jan  5 10:14:12 eduvpn-n09 kernel: [94308.032155] unregister_netdevice: waiting for tun321 to become free. Usage count = 1
Jan  5 10:14:12 eduvpn-n09 kernel: [94308.416149] unregister_netdevice: waiting for tun263 to become free. Usage count = 2
@ordex
Copy link
Member

ordex commented Jan 5, 2023

Thanks for the report! It's probably the same as #17
However, it'd be great to have a way to reproduce it, because the message is just a generic indication that the interface is still referenced by "something".

@bernhardschmidt
Copy link
Author

Just had an instance of #16 , where all processes died

Jan  9 11:43:47 eduvpn-n09 openvpn[601]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[607]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[606]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[608]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[600]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[603]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[602]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[604]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[605]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[599]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)

Of the 10 running processes 8 were restarted by systemd just fine. Two of them are in an uninterruptable sleep (D), and those are the two processes where the warning is logged.

@bernhardschmidt
Copy link
Author

These are the last messages of a broken server

Jan  9 11:43:47 eduvpn-n09 openvpn[600]: dco_do_read: netlink out of memory error: No buffer space available (errno=105)
Jan  9 11:43:47 eduvpn-n09 openvpn[600]: Exiting due to fatal error
Jan  9 11:43:47 eduvpn-n09 openvpn[600]: Closing DCO interface
Jan  9 11:43:47 eduvpn-n09 openvpn[600]: net_addr_v4_del: 129.187.98.129 dev tun321
Jan  9 11:43:47 eduvpn-n09 openvpn[600]: net_addr_v6_del: 2001:4ca0:2fff:2:3:1:1:1/112 dev tun321
Jan  9 11:43:47 eduvpn-n09 openvpn[600]: net_iface_del: delete tun321

@ordex ordex added the bug Something isn't working label Jan 15, 2023
@ordex
Copy link
Member

ordex commented Mar 24, 2023

It's very possible that this issue was solved by the latest changes we implemented.
That reworked the control packets flow and moved them away from netlink.

Any chance you could try to reproduce this one again with openvpn-2.6.2 and the latest master for ovpn-dco?

Thanks a lot

@bernhardschmidt
Copy link
Author

This is still happening with 2.6.2 and the new DCO

[287938.551622] unregister_netdevice: waiting for tun321 to become free. Usage count = 23
[287938.839851] unregister_netdevice: waiting for tun263 to become free. Usage count = 64

Happened on openvpn stop, process is in D state and unkillable

nobody       602  0.2  2.8 115712 112232 ?       Ds   Mar28  12:23 /usr/sbin/openvpn --status /run/openvpn-server/status-tum-full-node-03-1.log --status-version 2 --suppress-timestamps --config tum-full-node-03-1.conf
nobody       610  0.2  1.9  79528 76144 ?        Ds   Mar28  12:43 /usr/sbin/openvpn --status /run/openvpn-server/status-tum-split-node-03-7.log --status-version 2 --suppress-timestamps --config tum-split-node-03-7.conf

@ordex
Copy link
Member

ordex commented Mar 31, 2023

@bernhardschmidt thanks for reporting. The process is probably stuck on the iface-delete call, which won't return, because the networking stack is busy waiting for tunxxx to be released. Which probably is not going to happen.
If you don't stop OpenVPN it won't happen because there would be no hanging call.

This is due to some refcounting unbalance...

Did it happen with TCP or UDP?

@ordex ordex self-assigned this Mar 31, 2023
@bernhardschmidt
Copy link
Author

Good catch, the two hanging processes (out of 10) are the ones listening to TCP.

@hildeb
Copy link

hildeb commented Apr 25, 2023

I'm also encountering the issue with the hanging interface.
An TCP instance of 2.6.3 was being stopped:

...
Apr 25 08:36:10 openvpn-gw162-int openvpn-tcp[1623]: dco_parse_peer_multi: cannot store DCO stats for peer 32
Apr 25 08:36:13 openvpn-gw162-int openvpn-tcp[1623]: Closing DCO interface
Apr 25 08:36:13 openvpn-gw162-int openvpn-tcp[1623]: net_addr_v4_del: 172.29.136.1 dev tun1
Apr 25 08:36:13 openvpn-gw162-int openvpn-tcp[1623]: net_iface_del: delete tun1

after that systemd was complaining:

Apr 25 08:37:40 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: State 'stop-sigterm' timed out. Killing.
Apr 25 08:37:40 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: Killing process 1623 (openvpn) with signal SIGKILL.
Apr 25 08:39:10 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: Processes still around after SIGKILL. Ignoring.
Apr 25 08:40:40 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: State 'stop-final-sigterm' timed out. Killing.
Apr 25 08:40:40 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: Killing process 1623 (openvpn) with signal SIGKILL.
Apr 25 08:42:10 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: Processes still around after final SIGKILL. Entering failed mode.
Apr 25 08:42:10 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: Failed with result 'timeout'.
Apr 25 08:42:10 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: Found left-over process 1623 (openvpn) in control group while starting unit. Ignoring.
Apr 25 08:42:11 openvpn-gw162-int systemd[1]: openvpn-server@openvpn-tcp.service: Found left-over process 1623 (openvpn) in control group while starting unit. Ignoring.

Subsequently, I wasn't able to start, since:

Apr 25 08:42:11 openvpn-gw162-int openvpn-tcp[69606]: MANAGEMENT: Socket bind failed on local address [AF_INET]127.0.0.1:7506: Address already in use (errno=98)

Looking at kern.log, I found:

Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.124095] tun1: deleting peer with id 23, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.124118] tun1: deleting peer with id 22, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.124678] BUG: kernel NULL pointer dereference, address: 000000000000021c
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.124872] #PF: supervisor read access in kernel mode
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.124963] #PF: error_code(0x0000) - not-present page
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.125058] PGD 2b4d0067 P4D 2b4d0067 PUD 2b381067 PMD 0 
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.125164] Oops: 0000 [#1] SMP NOPTI
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.125271] CPU: 0 PID: 53946 Comm: kworker/0:1 Tainted: G        W  OE     5.15.0-58-generic #64~20.04.1-Ubuntu
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.125586] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.125906] Workqueue: ovpn-crypto-wq-tun1 ovpn_decrypt_work [ovpn_dco_v2]
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.126090] RIP: 0010:ovpn_decrypt_work+0x243/0x610 [ovpn_dco_v2]
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.126279] Code: 48 8b 35 80 b6 1f ea 89 c0 48 8d bb f8 05 00 00 48 01 c6 e8 cf 4d 56 e8 48 8b 83 10 04 00 00 48 8b 80 00 01 00 00 48 8b 
40 18 <66> 83 b8 1c 02 00 00 11 0f 84 fe 00 00 00 41 8b 44 24 70 f0 48 01
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.126934] RSP: 0018:ffffb7a280c43e20 EFLAGS: 00010286
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.127181] RAX: 0000000000000000 RBX: ffff9ece590a1880 RCX: 00000000000000d4
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.127443] RDX: ffffb7a280c43dd8 RSI: 0000000000000286 RDI: ffff9ecf3bc207c0
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.127695] RBP: ffffb7a280c43e58 R08: 0000000000000000 R09: 0000000000000000
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.127959] R10: ffff9ece2a6c0840 R11: 0000000000000010 R12: ffff9ecf18b46b00
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.128236] R13: ffff9ece590a19c8 R14: ffff9ece044def90 R15: 0000000000000000
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.128506] FS:  0000000000000000(0000) GS:ffff9ecf3bc00000(0000) knlGS:0000000000000000
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.128786] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.129078] CR2: 000000000000021c CR3: 00000000254ba004 CR4: 00000000007706f0
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.129401] PKRU: 55555554
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.129709] Call Trace:
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.130009]  <TASK>
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.130311]  process_one_work+0x228/0x3d0
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.130643]  worker_thread+0x4d/0x3f0
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.130969]  ? process_one_work+0x3d0/0x3d0
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.131305]  kthread+0x127/0x150
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.131639]  ? set_kthread_struct+0x50/0x50
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.131984]  ret_from_fork+0x1f/0x30
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.132384]  </TASK>
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.132719] Modules linked in: sctp ovpn_dco_v2(OE) ip6_udp_tunnel udp_tunnel tcp_diag inet_diag nfnetlink_log vsock_loopback vmw_vsock_vi
rtio_transport_common vmw_vsock_vmci_transport vsock dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua nft_reject_ipv4 nft_counter binfmt_misc crct10dif_pclmul vmw_balloon crc32_pclmul ghash_clmulni_intel nf_log_syslog ahci vmw_vmci joydev vmxnet3 libahci nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_log nft_ct nf_tables nfnetlink sch_fq_codel msr nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 efi_pstore ip_tables x_tables autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear vmwgfx ttm input_leds drm_kms_helper aesni_intel syscopyarea sysfillrect mptspi sysimgblt scsi_transport_spi fb_sys_fops cec crypto_simd mptscsih rc_core psmouse cryptd serio_raw drm mptbase
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.136954] CR2: 000000000000021c
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.137397] ---[ end trace a021367094290bf0 ]---
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.137835] RIP: 0010:ovpn_decrypt_work+0x243/0x610 [ovpn_dco_v2]
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.138269] Code: 48 8b 35 80 b6 1f ea 89 c0 48 8d bb f8 05 00 00 48 01 c6 e8 cf 4d 56 e8 48 8b 83 10 04 00 00 48 8b 80 00 01 00 00 48 8b 40 18 <66> 83 b8 1c 02 00 00 11 0f 84 fe 00 00 00 41 8b 44 24 70 f0 48 01
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.139553] RSP: 0018:ffffb7a280c43e20 EFLAGS: 00010286
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.139972] RAX: 0000000000000000 RBX: ffff9ece590a1880 RCX: 00000000000000d4
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.140388] RDX: ffffb7a280c43dd8 RSI: 0000000000000286 RDI: ffff9ecf3bc207c0
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.140814] RBP: ffffb7a280c43e58 R08: 0000000000000000 R09: 0000000000000000
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.141229] R10: ffff9ece2a6c0840 R11: 0000000000000010 R12: ffff9ecf18b46b00
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.141642] R13: ffff9ece590a19c8 R14: ffff9ece044def90 R15: 0000000000000000
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.142073] FS:  0000000000000000(0000) GS:ffff9ecf3bc00000(0000) knlGS:0000000000000000
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.142511] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.142922] CR2: 000000000000021c CR3: 00000000254ba004 CR4: 00000000007706f0
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.143384] PKRU: 55555554
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.143821] tun1: deleting peer with id 21, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.144304] tun1: deleting peer with id 19, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.144988] tun1: deleting peer with id 18, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.145522] tun1: deleting peer with id 17, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.145965] tun1: deleting peer with id 16, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.146411] tun1: deleting peer with id 15, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.146839] tun1: deleting peer with id 13, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.147245] tun1: deleting peer with id 12, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.147664] tun1: deleting peer with id 11, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.148192] tun1: deleting peer with id 0, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.151984] tun1: deleting peer with id 10, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.152477] tun1: deleting peer with id 9, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.152868] tun1: deleting peer with id 7, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.153264] tun1: deleting peer with id 6, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.153605] tun1: deleting peer with id 5, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.153940] tun1: deleting peer with id 4, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.154260] tun1: deleting peer with id 2, reason 1
Apr 25 08:36:10 openvpn-gw162-int kernel: [43533.154568] tun1: deleting peer with id 1, reason 1
Apr 25 08:36:23 openvpn-gw162-int kernel: [43546.267778] unregister_netdevice: waiting for tun1 to become free. Usage count = 3
Apr 25 08:36:30 openvpn-gw162-int kernel: [43553.247650] tun0: deleting peer with id 69, reason 1
Apr 25 08:36:33 openvpn-gw162-int kernel: [43556.347620] unregister_netdevice: waiting for tun1 to become free. Usage count = 3
Apr 25 08:36:43 openvpn-gw162-int kernel: [43566.427435] unregister_netdevice: waiting for tun1 to become free. Usage count = 3
Apr 25 08:36:50 openvpn-gw162-int kernel: [43573.075332] tun0: deleting peer with id 77, reason 1
Apr 25 08:36:53 openvpn-gw162-int kernel: [43576.507298] unregister_netdevice: waiting for tun1 to become free. Usage count = 3
Apr 25 08:37:03 openvpn-gw162-int kernel: [43586.587123] unregister_netdevice: waiting for tun1 to become free. Usage count = 3
Apr 25 08:37:06 openvpn-gw162-int kernel: [43589.363069] tun0: deleting peer with id 79, reason 1
Apr 25 08:37:13 openvpn-gw162-int kernel: [43596.666984] unregister_netdevice: waiting for tun1 to become free. Usage count = 3
Apr 25 08:37:23 openvpn-gw162-int kernel: [43606.746784] unregister_netdevice: waiting for tun1 to become free. Usage count = 3
Apr 25 08:37:30 openvpn-gw162-int kernel: [43612.910705] tun0: deleting peer with id 70, reason 2
Apr 25 08:37:33 openvpn-gw162-int kernel: [43616.826641] unregister_netdevice: waiting for tun1 to become free. Usage count = 3
Apr 25 08:37:37 openvpn-gw162-int kernel: [43620.090577] tun0: deleting peer with id 82, reason 1
Apr 25 08:37:42 openvpn-gw162-int kernel: [43625.106540] tun0: deleting peer with id 84, reason 1
...

@mr-liusg
Copy link

mr-liusg commented May 13, 2023

I found a similar problem on my UDP server.
But it was hard to reproduce and I ran it for 2-3 days before the problem happened.

image

@ordex
Copy link
Member

ordex commented May 13, 2023

@mr-liusg this is nott he same issue. The messages you see are about incoming traffic for a peer that is not known.
If you feel this is a bug, please open another ticket.

@mr-liusg
Copy link

@mr-liusg this is nott he same issue. The messages you see are about incoming traffic for a peer that is not known. If you feel this is a bug, please open another ticket.

I looked at the logs in depth today. I believe the problem is similar to #13. I've posted my error stack over.

@claw6148
Copy link

claw6148 commented Oct 17, 2023

I can reproduce this issue in a stable way on my machine:

  1. Run the ovpn server in tcp mode (ovpn: 2.6.6, dco: v0.2.20230426)
  2. Connect to the ovpn server, and then perform the iperf3 upload test (from client to server)
  3. Press Ctrl-C to exit the ovpn server during testing, and the problem reproduces

I added some dump_stack() and printk() in the DCO driver.
The dmesg logs below is a comparison of successful/failure exits of the ovpn server.
On failure scenario ovpn_peer_delete_work() is not executed.

dmesg logs

successful

[   93.195417] CPU: 2 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[   93.201518] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb6c
[   93.209853]         808b0000 8fe0c2f0 80750d63 806860e4 00000002 00000001 8df4bb10 964647d0
[   93.218179]         00000000 00000000 808f0000 00000000 30232038 00000129 2e352064 33322e34
[   93.226500]         00000000 00000004 00000000 0002fb59 80000000 80770000 00000000 00000000
[   93.234821]         8f16d480 00000000 807683e0 8eabcd80 00000000 80369ec8 003d1a6f 003d93ef
[   93.243142]         ...
[   93.245577] Call Trace:
[   93.248047] [<8000b6cc>] show_stack+0x30/0x100
[   93.252512] [<805c7dcc>] dump_stack+0xa4/0xdc
[   93.256993] [<8eac9288>] ovpn_peer_release_kref+0x68/0x7c [ovpn_dco_v2]
[   93.263626] [<804aa770>] genl_rcv_msg+0x24c/0x4cc
[   93.268313] [<804a8c34>] netlink_rcv_skb+0xac/0x158
[   93.273172] [<804a94d4>] genl_rcv+0x30/0x48
[   93.277340] [<804a823c>] netlink_unicast+0x1c4/0x2e4
[   93.282285] [<804a86a8>] netlink_sendmsg+0x34c/0x4a4
[   93.287249] [<80416b28>] ____sys_sendmsg+0xc0/0x264
[   93.292109] [<80417438>] ___sys_sendmsg+0x7c/0xcc
[   93.296801] [<80418658>] __sys_sendmsg+0x4c/0x94
[   93.301411] [<80014598>] syscall_common+0x34/0x58
[   93.307392] CPU: 2 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[   93.313483] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb6c
[   93.321815]         808b0000 8fe0c2f0 80750d63 806860e4 00000002 00000001 8df4bb10 964647d0
[   93.330136]         00000000 00000000 808f0000 00000000 30232038 0000013d 2e352064 33322e34
[   93.338457]         00000000 00000004 00000000 0004b0c0 80000000 80770000 00000000 00000000
[   93.346777]         8f16d900 00000000 807683e0 8eabcd80 00000000 80369ec8 003d17af 003d932f
[   93.355099]         ...
[   93.357535] Call Trace:
[   93.360006] [<8000b6cc>] show_stack+0x30/0x100
[   93.364468] [<805c7dcc>] dump_stack+0xa4/0xdc
[   93.368942] [<8eac9288>] ovpn_peer_release_kref+0x68/0x7c [ovpn_dco_v2]
[   93.375569] [<804aa770>] genl_rcv_msg+0x24c/0x4cc
[   93.380256] [<804a8c34>] netlink_rcv_skb+0xac/0x158
[   93.385116] [<804a94d4>] genl_rcv+0x30/0x48
[   93.389284] [<804a823c>] netlink_unicast+0x1c4/0x2e4
[   93.394230] [<804a86a8>] netlink_sendmsg+0x34c/0x4a4
[   93.399193] [<80416b28>] ____sys_sendmsg+0xc0/0x264
[   93.404052] [<80417438>] ___sys_sendmsg+0x7c/0xcc
[   93.408744] [<80418658>] __sys_sendmsg+0x4c/0x94
[   93.413351] [<80014598>] syscall_common+0x34/0x58
[   93.497190] CPU: 2 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[   93.503281] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb6c
[   93.511609]         808b0000 8fe0c2f0 80750d63 806860e4 00000002 00000001 8df4bb10 964647d0
[   93.519932]         00000000 00000000 808f0000 00000000 30232038 00000151 2e352064 33322e34
[   93.528260]         00000000 00000004 00000000 00079626 80000000 80770000 00000000 00000000
[   93.536581]         8f16d480 00000000 807683e0 8eabcd80 00000000 80369ec8 00000008 808b0008
[   93.544905]         ...
[   93.547339] Call Trace:
[   93.549809] [<8000b6cc>] show_stack+0x30/0x100
[   93.554274] [<805c7dcc>] dump_stack+0xa4/0xdc
[   93.558746] [<8eac9288>] ovpn_peer_release_kref+0x68/0x7c [ovpn_dco_v2]
[   93.565379] [<804aa770>] genl_rcv_msg+0x24c/0x4cc
[   93.570066] [<804a8c34>] netlink_rcv_skb+0xac/0x158
[   93.574925] [<804a94d4>] genl_rcv+0x30/0x48
[   93.579093] [<804a823c>] netlink_unicast+0x1c4/0x2e4
[   93.584039] [<804a86a8>] netlink_sendmsg+0x34c/0x4a4
[   93.589002] [<80416b28>] ____sys_sendmsg+0xc0/0x264
[   93.593862] [<80417438>] ___sys_sendmsg+0x7c/0xcc
[   93.598554] [<80418658>] __sys_sendmsg+0x4c/0x94
[   93.603162] [<80014598>] syscall_common+0x34/0x58
[  111.836167] *** Quit OVPN Here ***
[  114.369607] CPU: 2 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  114.375703] Stack : 8e216700 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4ba84
[  114.384035]         808b0000 8fe0c2f0 80750d63 806860e4 00000002 00000001 8df4ba28 964647d0
[  114.392357]         00000000 00000000 808f0000 00000000 30232038 00000166 2e352064 33322e34
[  114.400681]         00000000 00000005 00000000 0005a3c7 80000000 80770000 00000000 8df31250
[  114.409006]         00000001 00000002 8df844b4 8df844b4 00000000 80369ec8 0803e098 808b0008
[  114.417327]         ...
[  114.419761] Call Trace:
[  114.422230] [<8000b6cc>] show_stack+0x30/0x100
[  114.426693] [<805c7dcc>] dump_stack+0xa4/0xdc
[  114.431172] [<8eac679c>] ovpn_peer_stats_init+0x183c/0x192c [ovpn_dco_v2]
[  114.438681] CPU: 2 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  114.444752] Stack : 8e216700 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bbf4
[  114.453078]         808b0000 8fe0c2f0 80750d63 806860e4 00000002 00000001 8df4bb98 964647d0
[  114.461399]         00000000 00000000 808f0000 00000000 30232038 00000171 2e352064 33322e34
[  114.469720]         00000000 00000005 00000000 0006b199 80000000 80770000 00000000 8df31250
[  114.478040]         00000000 00000002 8df844b4 8df844b4 00000000 80369ec8 003d17af 003d932f
[  114.486361]         ...
[  114.488795] Call Trace:
[  114.491257] [<8000b6cc>] show_stack+0x30/0x100
[  114.495711] [<805c7dcc>] dump_stack+0xa4/0xdc
[  114.500151] [<8eac679c>] ovpn_peer_stats_init+0x183c/0x192c [ovpn_dco_v2]
[  114.508730] CPU: 2 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  114.514822] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb2c
[  114.523155]         808b0000 8fe0c2f0 80750d63 806860e4 00000002 00000001 8df4bad0 964647d0
[  114.531476]         00000000 00000000 808f0000 00000000 30232038 0000017c 2e352064 33322e34
[  114.539797]         00000000 00000005 00000000 0007c33a 80000000 80770000 00000000 00000000
[  114.548117]         8f16df00 00000000 807683e0 8eabcd80 00000000 80369ec8 003d17af 003d932f
[  114.556438]         ...
[  114.558874] Call Trace:
[  114.561346] [<8000b6cc>] show_stack+0x30/0x100
[  114.565809] [<805c7dcc>] dump_stack+0xa4/0xdc
[  114.570277] [<8eac9234>] ovpn_peer_release_kref+0x14/0x7c [ovpn_dco_v2]
[  114.576916] [<8eac61b8>] ovpn_peer_stats_init+0x1258/0x192c [ovpn_dco_v2]
[  114.583954] CPU: 2 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  114.590021] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb6c
[  114.598347]         808b0000 8fe0c2f0 80750d63 806860e4 00000002 00000001 8df4bb10 964647d0
[  114.606668]         00000000 00000000 808f0000 00000000 30232038 00000188 2e352064 33322e34
[  114.614990]         00000000 00000005 00000000 0008e912 80000000 80770000 00000000 00000000
[  114.623312]         8f16df00 00000000 807683e0 8eabcd80 00000000 80369ec8 003d172f 003d176f
[  114.631634]         ...
[  114.634068] Call Trace:
[  114.636517] [<8000b6cc>] show_stack+0x30/0x100
[  114.640960] [<805c7dcc>] dump_stack+0xa4/0xdc
[  114.645356] [<8eac9288>] ovpn_peer_release_kref+0x68/0x7c [ovpn_dco_v2]
[  114.651988] [<804aa770>] genl_rcv_msg+0x24c/0x4cc
[  114.656676] [<804a8c34>] netlink_rcv_skb+0xac/0x158
[  114.661535] [<804a94d4>] genl_rcv+0x30/0x48
[  114.665704] [<804a823c>] netlink_unicast+0x1c4/0x2e4
[  114.670651] [<804a86a8>] netlink_sendmsg+0x34c/0x4a4
[  114.675614] [<80416b28>] ____sys_sendmsg+0xc0/0x264
[  114.680474] [<80417438>] ___sys_sendmsg+0x7c/0xcc
[  114.685166] [<80418658>] __sys_sendmsg+0x4c/0x94
[  114.689776] [<80014598>] syscall_common+0x34/0x58
[  114.695137] ovpn_peer_delete_work() enter
[  114.699223] ovpn_tcp_socket_detach() enter
[  114.703574] ovpn_tcp_socket_detach() leave
[  114.707700] ovpn_peer_delete_work() mid
[  114.711556] ovs: deleting peer with id 0, reason 1
[  114.716396] ovpn_peer_delete_work() leave
[  114.732990] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.238 #0
[  114.738981] Stack : 807a42b8 800804d0 80700000 807067d4 80770000 8070679c 80705498 8fc11e84
[  114.747308]         808b0000 8fc3c8b0 80750d63 806860e4 00000002 00000001 8fc11e28 964647d0
[  114.755629]         00000000 00000000 808f0000 00000000 00000030 000001a2 342e3520 3833322e
[  114.763949]         00000000 00000005 00000000 000b2f3e 00000000 80770000 00000000 80756ac0
[  114.772269]         8fc11f20 00000000 00000000 0000000a 00000003 80369ec8 003d1a6f 003d93ef
[  114.780590]         ...
[  114.783025] Call Trace:
[  114.785496] [<8000b6cc>] show_stack+0x30/0x100
[  114.789960] [<805c7dcc>] dump_stack+0xa4/0xdc
[  114.794439] [<8eac27d8>] ovpn_send_data+0x43c/0x7a0 [ovpn_dco_v2]

failure

[  162.589850] CPU: 0 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  162.595949] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb6c
[  162.604279]         808b0000 8fe0c2f0 80750d63 806860e4 00000000 00000001 8df4bb10 964647d0
[  162.612599]         00000000 00000000 808f0000 00000000 30232038 00000230 2e352064 33322e34
[  162.620918]         00000000 00000007 00000000 0009001a 80000000 80770000 00000000 00000000
[  162.629237]         8ff79540 00000000 807683e0 8eabcd80 00000000 80369ec8 003d1a6f 003d93ef
[  162.637557]         ...
[  162.639990] Call Trace:
[  162.642460] [<8000b6cc>] show_stack+0x30/0x100
[  162.646920] [<805c7dcc>] dump_stack+0xa4/0xdc
[  162.651398] [<8eac9288>] ovpn_peer_release_kref+0x68/0x7c [ovpn_dco_v2]
[  162.658025] [<804aa770>] genl_rcv_msg+0x24c/0x4cc
[  162.662709] [<804a8c34>] netlink_rcv_skb+0xac/0x158
[  162.667564] [<804a94d4>] genl_rcv+0x30/0x48
[  162.671730] [<804a823c>] netlink_unicast+0x1c4/0x2e4
[  162.676672] [<804a86a8>] netlink_sendmsg+0x34c/0x4a4
[  162.681633] [<80416b28>] ____sys_sendmsg+0xc0/0x264
[  162.686489] [<80417438>] ___sys_sendmsg+0x7c/0xcc
[  162.691177] [<80418658>] __sys_sendmsg+0x4c/0x94
[  162.695782] [<80014598>] syscall_common+0x34/0x58
[  162.701889] CPU: 0 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  162.707979] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb6c
[  162.716314]         808b0000 8fe0c2f0 80750d63 806860e4 00000000 00000001 8df4bb10 964647d0
[  162.724634]         00000000 00000000 808f0000 00000000 30232038 00000244 2e352064 33322e34
[  162.732953]         00000000 00000007 00000000 000ab5c1 80000000 80770000 00000000 00000000
[  162.741272]         8ff79480 00000000 807683e0 8eabcd80 00000000 80369ec8 003d11af 003d91af
[  162.749591]         ...
[  162.752026] Call Trace:
[  162.754492] [<8000b6cc>] show_stack+0x30/0x100
[  162.758952] [<805c7dcc>] dump_stack+0xa4/0xdc
[  162.763422] [<8eac9288>] ovpn_peer_release_kref+0x68/0x7c [ovpn_dco_v2]
[  162.770047] [<804aa770>] genl_rcv_msg+0x24c/0x4cc
[  162.774731] [<804a8c34>] netlink_rcv_skb+0xac/0x158
[  162.779587] [<804a94d4>] genl_rcv+0x30/0x48
[  162.783753] [<804a823c>] netlink_unicast+0x1c4/0x2e4
[  162.788696] [<804a86a8>] netlink_sendmsg+0x34c/0x4a4
[  162.793656] [<80416b28>] ____sys_sendmsg+0xc0/0x264
[  162.798512] [<80417438>] ___sys_sendmsg+0x7c/0xcc
[  162.803200] [<80418658>] __sys_sendmsg+0x4c/0x94
[  162.807805] [<80014598>] syscall_common+0x34/0x58
[  162.815836] CPU: 0 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  162.821929] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb6c
[  162.830260]         808b0000 8fe0c2f0 80750d63 806860e4 00000000 00000001 8df4bb10 964647d0
[  162.838580]         00000000 00000000 808f0000 00000000 30232038 00000258 2e352064 33322e34
[  162.846900]         00000000 00000007 00000000 000c72dc 80000000 80770000 00000000 00000000
[  162.855219]         8ff79900 00000000 807683e0 8eabcd80 00000000 80369ec8 003d11af 003d91af
[  162.863539]         ...
[  162.865972] Call Trace:
[  162.868439] [<8000b6cc>] show_stack+0x30/0x100
[  162.872899] [<805c7dcc>] dump_stack+0xa4/0xdc
[  162.877372] [<8eac9288>] ovpn_peer_release_kref+0x68/0x7c [ovpn_dco_v2]
[  162.883997] [<804aa770>] genl_rcv_msg+0x24c/0x4cc
[  162.888680] [<804a8c34>] netlink_rcv_skb+0xac/0x158
[  162.893537] [<804a94d4>] genl_rcv+0x30/0x48
[  162.897703] [<804a823c>] netlink_unicast+0x1c4/0x2e4
[  162.902645] [<804a86a8>] netlink_sendmsg+0x34c/0x4a4
[  162.907606] [<80416b28>] ____sys_sendmsg+0xc0/0x264
[  162.912462] [<80417438>] ___sys_sendmsg+0x7c/0xcc
[  162.917151] [<80418658>] __sys_sendmsg+0x4c/0x94
[  162.921757] [<80014598>] syscall_common+0x34/0x58
[  164.893558] *** Quit OVPN Here ***
[  177.440740] CPU: 0 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  177.446838] Stack : 8de35300 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4ba84
[  177.455186]         808b0000 8fe0c2f0 80750d63 806860e4 00000000 00000001 8df4ba28 964647d0
[  177.463539]         00000000 00000000 808f0000 00000000 30232038 0000026d 2e352064 33322e34
[  177.471893]         00000000 00000008 00000000 0006b9a4 80000000 80770000 00000000 8df31250
[  177.480239]         00000001 00000002 8df844b4 8df844b4 00000000 80369ec8 08042098 808b0000
[  177.488586]         ...
[  177.491037] Call Trace:
[  177.493533] [<8000b6cc>] show_stack+0x30/0x100
[  177.498030] [<805c7dcc>] dump_stack+0xa4/0xdc
[  177.502573] [<8eac679c>] ovpn_peer_stats_init+0x183c/0x192c [ovpn_dco_v2]
[  177.512047] CPU: 3 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  177.518143] Stack : 8e962800 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bbf4
[  177.526503]         808b0000 8fe0c2f0 80750d63 806860e4 00000003 00000001 8df4bb98 964647d0
[  177.534868]         00000000 00000000 808f0000 00000000 30232038 00000278 2e352064 33322e34
[  177.543234]         00000000 00000008 00000000 0007d02f 80000000 80770000 00000000 8df31250
[  177.551613]         00000000 00000002 8df844b4 8df844b4 00000000 80369ec8 0803e098 808b000c
[  177.559996]         ...
[  177.562463] Call Trace:
[  177.564962] [<8000b6cc>] show_stack+0x30/0x100
[  177.569489] [<805c7dcc>] dump_stack+0xa4/0xdc
[  177.574039] [<8eac679c>] ovpn_peer_stats_init+0x183c/0x192c [ovpn_dco_v2]
[  177.585732] CPU: 3 PID: 3337 Comm: openvpn Not tainted 5.4.238 #0
[  177.591837] Stack : 8df4bc74 800804d0 80700000 807067d4 80770000 8070679c 80705498 8df4bb6c
[  177.600208]         808b0000 8fe0c2f0 80750d63 806860e4 00000003 00000001 8df4bb10 964647d0
[  177.608572]         00000000 00000000 808f0000 00000000 30232038 00000283 2e352064 33322e34
[  177.616927]         00000000 00000008 00000000 0008f004 80000000 80770000 00000000 00000000
[  177.625277]         8feaf240 00000000 807683e0 8eabcd80 00000000 80369ec8 003d1a6f 003d93ef
[  177.633628]         ...
[  177.636085] Call Trace:
[  177.638573] [<8000b6cc>] show_stack+0x30/0x100
[  177.643058] [<805c7dcc>] dump_stack+0xa4/0xdc
[  177.647586] [<8eac9288>] ovpn_peer_release_kref+0x68/0x7c [ovpn_dco_v2]
[  177.654244] [<804aa770>] genl_rcv_msg+0x24c/0x4cc
[  177.658932] [<804a8c34>] netlink_rcv_skb+0xac/0x158
[  177.663794] [<804a94d4>] genl_rcv+0x30/0x48
[  177.667966] [<804a823c>] netlink_unicast+0x1c4/0x2e4
[  177.672911] [<804a86a8>] netlink_sendmsg+0x34c/0x4a4
[  177.677880] [<80416b28>] ____sys_sendmsg+0xc0/0x264
[  177.682741] [<80417438>] ___sys_sendmsg+0x7c/0xcc
[  177.687436] [<80418658>] __sys_sendmsg+0x4c/0x94
[  177.692051] [<80014598>] syscall_common+0x34/0x58
[  187.815984] unregister_netdevice: waiting for ovs to become free. Usage count = 1
[  197.896299] unregister_netdevice: waiting for ovs to become free. Usage count = 1
[  207.976583] unregister_netdevice: waiting for ovs to become free. Usage count = 1
[  218.057089] unregister_netdevice: waiting for ovs to become free. Usage count = 1

@cron2
Copy link
Contributor

cron2 commented Oct 17, 2023

Oh, this is exciting news. Especially having instructions how to reproduce this is really really good.

(My testing so far focused on heavy client connect/disconnect churn, which did trigger a few issues :-) - but yours seems to be a race condition between "(heavy) traffic inside the VPN" and "peer being torn down", which I did not specifically test yet)

@claw6148
Copy link

Oh, this is exciting news. Especially having instructions how to reproduce this is really really good.

(My testing so far focused on heavy client connect/disconnect churn, which did trigger a few issues :-) - but yours seems to be a race condition between "(heavy) traffic inside the VPN" and "peer being torn down", which I did not specifically test yet)

Yes, just one client is enough to reproduce.
This is less likely to happen if I perform an iperf3 download test (from server to client).

@ordex
Copy link
Member

ordex commented Oct 17, 2023

Super exciting! will give it a try today!

@claw6148
Copy link

Is it related to peer->refcount?
I added some outputs on ovpn_peer_hold() and ovpn_peer_put() as shown below

diff --git a/drivers/net/ovpn-dco/peer.h b/drivers/net/ovpn-dco/peer.h
index c4f72f6..a439e14 100644
--- a/drivers/net/ovpn-dco/peer.h
+++ b/drivers/net/ovpn-dco/peer.h
@@ -122,12 +122,16 @@ void ovpn_peer_release(struct ovpn_peer *peer);
 
 static inline bool ovpn_peer_hold(struct ovpn_peer *peer)
 {
+	printk("ovpn_peer_hold peer->refcount = %d\n", peer->refcount.refcount.refs.counter);
+	dump_stack();
 	return kref_get_unless_zero(&peer->refcount);
 }
 
 static inline void ovpn_peer_put(struct ovpn_peer *peer)
 {
 	kref_put(&peer->refcount, ovpn_peer_release_kref);
+	printk("ovpn_peer_put peer->refcount = %d\n", peer->refcount.refcount.refs.counter);
+	dump_stack();
 }
 
 static inline void ovpn_peer_keepalive_recv_reset(struct ovpn_peer *peer)

The refcount.log is the dmesg log.
At lines 5713 and 5725, ovpn_peer_hold() shows peer->refcount as 0 twice, but there is no call to ovpn_peer_put() between the two ovpn_peer_hold() calls.
Line 5739, refcount_t: underflow; use-after-free.

@ordex
Copy link
Member

ordex commented Oct 17, 2023

most likely yes. there must be some non-trivial imbalance that is triggered by a race condition..

@ordex
Copy link
Member

ordex commented Oct 17, 2023

At lines 5713 and 5725, ovpn_peer_hold() shows peer->refcount as 0 twice, but there is no call to ovpn_peer_put() between the two ovpn_peer_hold() calls.

This is not an issue per se. when _hold() is called and refcount is 0, the reference is simply not obtained because the peer is considered gone. This is why there is no put. refcount in this case stays at zero because the object has reached end of life

@ordex
Copy link
Member

ordex commented Oct 17, 2023

Mhh.. we have an ovpn_peer_hold() in ovpn_tcp_read_sock() where we don't check the return value.
This means that if the refcount was already 0 at this point, it will stay so (due to the explanation above), but the following put() will trigger an underflow.

Which is what is happening later in your log at line 5739 and indeed confirmed by line 5758:

5758 [ 107.155298] ovpn_peer_put peer->refcount = -1073741824

Now the thing is: we have that _hold() without return value check because we assume that at that point in time we must be already holding a ref to that peer. We should not be there if we don't hold a reference.
So this points to some imbalance in the way the TCP sock object holds a reference to the related peer....

@claw6148
Copy link

claw6148 commented Oct 17, 2023

The platform I'm using above is a router with an MT7621A CPU, I just tried testing the same way on a x64 VM but haven't reproduced the issue yet... :-(

@ordex
Copy link
Member

ordex commented Oct 17, 2023

still the hints you provided kinda point us towards some more circumvented direction. If I am unable to reproduce as well, I may setup an old MT7621 based router and give it a shot.

@ordex
Copy link
Member

ordex commented Oct 18, 2023

Interesting enough, on my virtual setup I can reproduce the "Waiting for tunX to become free" issue by following your steps. But I don't see the underflow.
They may just be two symptoms of the same bug. It may just depend on timing.

@claw6148
Copy link

claw6148 commented Oct 18, 2023

I tried moving ovpn_peer_hold() from ovpn_tcp_read_sock() to ovpn_decrypt_work() and it seemed to be OK.

diff --git a/drivers/net/ovpn-dco/ovpn.c b/drivers/net/ovpn-dco/ovpn.c
index 772a34b..4a482f7 100644
--- a/drivers/net/ovpn-dco/ovpn.c
+++ b/drivers/net/ovpn-dco/ovpn.c
@@ -163,8 +163,7 @@ int ovpn_recv(struct ovpn_struct *ovpn, struct ovpn_peer *peer, struct sk_buff *
 	if (unlikely(ptr_ring_produce_bh(&peer->rx_ring, skb) < 0))
 		return -ENOSPC;
 
-	if (!queue_work(ovpn->crypto_wq, &peer->decrypt_work))
-		ovpn_peer_put(peer);
+	queue_work(ovpn->crypto_wq, &peer->decrypt_work);
 
 	return 0;
 }
@@ -263,6 +262,10 @@ void ovpn_decrypt_work(struct work_struct *work)
 	struct sk_buff *skb;
 
 	peer = container_of(work, struct ovpn_peer, decrypt_work);
+
+	if (!ovpn_peer_hold(peer))
+		return;
+
 	while ((skb = ptr_ring_consume_bh(&peer->rx_ring))) {
 		if (likely(ovpn_decrypt_one(peer, skb) == 0)) {
 			/* if a packet has been enqueued for NAPI, signal
diff --git a/drivers/net/ovpn-dco/tcp.c b/drivers/net/ovpn-dco/tcp.c
index 288a691..c794d1f 100644
--- a/drivers/net/ovpn-dco/tcp.c
+++ b/drivers/net/ovpn-dco/tcp.c
@@ -103,8 +103,7 @@ static int ovpn_tcp_read_sock(read_descriptor_t *desc, struct sk_buff *in_skb,
 			 * Queue skb for sending to userspace via recvmsg on the socket
 			 */
 			if (likely(ovpn_opcode_from_skb(peer->tcp.skb, 0) == OVPN_DATA_V2)) {
-				/* hold reference to peer as requird by ovpn_recv() */
-				ovpn_peer_hold(peer);
+				/* peer reference will be held in ovpn_decrypt_work() */
 				status = ovpn_recv(peer->ovpn, peer, peer->tcp.skb);
 			} else {
 				/* prepend skb with packet len. this way userspace can parse
diff --git a/drivers/net/ovpn-dco/udp.c b/drivers/net/ovpn-dco/udp.c
index db2b9e5..941dcf7 100644
--- a/drivers/net/ovpn-dco/udp.c
+++ b/drivers/net/ovpn-dco/udp.c
@@ -117,6 +117,8 @@ static int ovpn_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 		goto drop;
 	}
 
+	ovpn_peer_put(peer);
+
 	/* should this be a non DATA_V2 packet, ret will be >0 and this will instruct the UDP
 	 * stack to continue processing this packet as usual (i.e. deliver to user space)
 	 */

Apply the following patch and set a shorter keepalive in the config to see how the peer->refcount changes.

diff --git a/drivers/net/ovpn-dco/peer.c b/drivers/net/ovpn-dco/peer.c
index f9e72fb..bde3d83 100644
--- a/drivers/net/ovpn-dco/peer.c
+++ b/drivers/net/ovpn-dco/peer.c
@@ -24,7 +24,7 @@ static void ovpn_peer_ping(struct timer_list *t)
 {
 	struct ovpn_peer *peer = from_timer(peer, t, keepalive_xmit);
 
-	netdev_dbg(peer->ovpn->dev, "%s: sending ping to peer %u\n", __func__, peer->id);
+	netdev_dbg(peer->ovpn->dev, "%s: sending ping to peer %u refcount %d\n", __func__, peer->id, peer->refcount.refcount.refs.counter);
 	ovpn_keepalive_xmit(peer);
 }
 

@ordex
Copy link
Member

ordex commented Oct 18, 2023

This is racy: what guarantee do we have that between scheduling ovpn_decrypt_work() and executing it the peer isn't released and thus leading us to having a stale peer/timer pointer?

@ordex
Copy link
Member

ordex commented Oct 18, 2023

the statement above is exactly the reason why the reference is obtained before scheduling the worker: we want to be sure that the peer stays alive all time long.
Although the problem is clearly around those lines, this solution is not acceptable.

@claw6148
Copy link

This is indeed a problem.

If only modify ovpn_peer_ping() to print the refcount, it will be very large (>1000 or even more) when the problem occurs.
When the iperf3 test is stopped, the refcount does not decrease and remains at a high level.

@ordex
Copy link
Member

ordex commented Oct 18, 2023

thanks for all these hints. they are very helpful!

@ordex
Copy link
Member

ordex commented Nov 2, 2023

the master branch contains what we believe to be a fix for this issue. @claw6148 would you be able to give it a try?

@claw6148
Copy link

claw6148 commented Nov 2, 2023

wow, tested several times and no hangings yet

still prints refcount_t: underflow; use-after-free. but rarely happens and not affecting ovpn to be exited (#29)

@ordex
Copy link
Member

ordex commented Nov 2, 2023

I think that's the problem reported in #29 ? would you mind appending your log there? Thanks!

@ordex
Copy link
Member

ordex commented Nov 2, 2023

So I'd say this issue can be closed. Thank you all for the feedback provided so far!
Feel free to reopen it should you be able to trigger this issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants