Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server fails with glib2 2.76.0 #146

Closed
eworm-de opened this issue Mar 15, 2023 · 22 comments
Closed

server fails with glib2 2.76.0 #146

eworm-de opened this issue Mar 15, 2023 · 22 comments

Comments

@eworm-de
Copy link
Contributor

Have been wondering why my clients fail to boot via PXE, where NBD hangs... After some debugging it turns out the client is fine, but server behaves bad due to an update of glib2 to version 2.76.0.

No idea what change caused this... I will try to dig deeper.

@eworm-de
Copy link
Contributor Author

Starting nbd-server with -d makes the issue go away...

@eworm-de
Copy link
Contributor Author

eworm-de commented Mar 17, 2023

Bisected and found this to be the commit causing the trouble:
https://gitlab.gnome.org/GNOME/glib/-/commit/4d2e77a5546cccc10f83d103d6f1527a5c9bcf13

Looks like threading is broken here...

@yoe
Copy link
Member

yoe commented Mar 18, 2023

Sigh.

That linked page says:

As such, get rid of all the complicated code that tried to do this better

That's not very promising...

@eworm-de
Copy link
Contributor Author

I think the complicated code was referencing what was removed in glib, not what we have in nbd. 😜

I am no very familiar with that code, so no idea how to proceed from here. Let me know if I can help to further narrow this.

Sadly the glib2 package was moved to Arch's official repositories before this issue was noticed. So currently the nbd server is non-functional without extra action.

@yoe
Copy link
Member

yoe commented Mar 25, 2023

I think the complicated code was referencing what was removed in glib, not what we have in nbd.

Yes, that was clear :-)

I am no very familiar with that code, so no idea how to proceed from here. Let me know if I can help to further narrow this.

One thing you could do is run the "drd" tool of valgrind:

valgrind --tool=drd nbd-server ...

and post the output. Perhaps there's a bug in my code, but I'm not sure; It feels like a bug in glib to me, not in nbd.

Would you be willing to file a bug report against glib? Feel free to refer to this one -- and please also post the link to it in this bug report. I'll follow up there if necessary.

Sadly the glib2 package was moved to Arch's official repositories before this issue was noticed. So currently the nbd server is non-functional without extra action.

Yeah, not great :-(

@eworm-de
Copy link
Contributor Author

One thing you could do is run the "drd" tool of valgrind:

valgrind --tool=drd nbd-server ...

and post the output. Perhaps there's a bug in my code, but I'm not sure; It feels like a bug in glib to me, not in nbd.

Ran the command as given. As nbd forks to background the command finish, with this output:

==31341== drd, a thread error detector                                                                                                                                                     
==31341== Copyright (C) 2006-2020, and GNU GPL'd, by Bart Van Assche.                                                                                                                      
==31341== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info                                                                                                               
==31341== Command: nbd-server                                                                                                                                                              
==31341==                                                                                                                                                                                  
==31341==                                                                                                                                                                                  
==31341== For lists of detected and suppressed errors, rerun with: -s                                                                                                                      
==31341== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)                                                                                                                   
--31392-- WARNING: unhandled amd64-linux syscall: 441                                                                                                                        
--31392-- You may be able to write your own handler.                                                                                                                                       
--31392-- Read the file README_MISSING_SYSCALL_OR_IOCTL.                                                                                                                                   
--31392-- Nevertheless we consider this a bug.  Please report                                                                                                                              
--31392-- it at http://valgrind.org/support/bug_reports.html.

The forked process was still running in background, so I connected to nbd and saw the bad behavior. Then kill the processes:

==31563==                                                                                    
==31563== Process terminating with default action of signal 15 (SIGTERM)
==31563==    at 0x4D060DB: syscall (syscall.S:37)
==31563==    by 0x496D7B4: g_cond_wait (gthread-posix.c:1475)
==31563==    by 0x49736B3: g_thread_pool_start_thread.constprop.0 (gthreadpool.c:481)
==31563==    by 0x4949682: g_thread_pool_push (gthreadpool.c:711)
==31563==    by 0x1140A9: mainloop_threaded.isra.0 (nbd-server.c:2925)
==31563==    by 0x10D4DC: UnknownInlinedFun (nbd-server.c:3062)
==31563==    by 0x10D4DC: UnknownInlinedFun (nbd-server.c:3295)
==31563==    by 0x10D4DC: main (nbd-server.c:3740)
==31563== 
==31563== For lists of detected and suppressed errors, rerun with: -s
==31563== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 14 from 6)
==31392== The object at address 0x4fcd020 is not a mutex.
==31392==    at 0x4853DF7: pthread_mutex_destroy_intercept (drd_pthread_intercepts.c:912)
==31392==    by 0x4853DF7: pthread_mutex_destroy@* (drd_pthread_intercepts.c:922)
==31392==    by 0x4ECD0DC: ??? (in /usr/lib/libp11-kit.so.0.3.0)
==31392==    by 0x40010E1: _dl_call_fini (dl-call_fini.c:43)
==31392==    by 0x4004F4D: _dl_fini (dl-fini.c:114)
==31392==    by 0x4C41444: __run_exit_handlers (exit.c:108)
==31392==    by 0x4C415AF: exit (exit.c:138)
==31392==    by 0x10C51B: UnknownInlinedFun (nbd-server.c:3227)
==31392==    by 0x10C51B: main (nbd-server.c:3740)
==31392== 
==31392== The object at address 0x4fcd060 is not a mutex.
==31392==    at 0x4853DF7: pthread_mutex_destroy_intercept (drd_pthread_intercepts.c:912)
==31392==    by 0x4853DF7: pthread_mutex_destroy@* (drd_pthread_intercepts.c:922)
==31392==    by 0x4ECD0E9: ??? (in /usr/lib/libp11-kit.so.0.3.0)
==31392==    by 0x40010E1: _dl_call_fini (dl-call_fini.c:43)
==31392==    by 0x4004F4D: _dl_fini (dl-fini.c:114)
==31392==    by 0x4C41444: __run_exit_handlers (exit.c:108)
==31392==    by 0x4C415AF: exit (exit.c:138)
==31392==    by 0x10C51B: UnknownInlinedFun (nbd-server.c:3227)
==31392==    by 0x10C51B: main (nbd-server.c:3740)
==31392== 
==31392== 
==31392== For lists of detected and suppressed errors, rerun with: -s
==31392== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 14 from 6)

So looks like valgrind finds errors, but no idea what to blame.

Would you be willing to file a bug report against glib? Feel free to refer to this one -- and please also post the link to it in this bug report. I'll follow up there if necessary.

Will do next week with some spare time if we do not solve it here.

@eworm-de
Copy link
Contributor Author

Building with --without-gnutls gives the same behavior, but no errors from valgrind. 🤔

@eworm-de
Copy link
Contributor Author

Just opened an issue for glib...

@Rijndael1998
Copy link

Rijndael1998 commented Mar 29, 2023

My issue is similar, but not quite the same. My server starts (I think) and I can connect from a client. The drives show up with their sizes but when reading (cat /dev/nbd0 or dd if=/dev/nbd0), the read hangs indefinitely. I'm not going to try writing to drives for obvious reasons.

I'm not sure what could be causing this and I'm not sure how I can troubleshoot, but I'll be happy to help if I can.

I do know that glib version is 2.76.0-1 on Arch Linux.

@eworm-de
Copy link
Contributor Author

Sounds exactly the same.

@Rijndael1998
Copy link

Downgrading to glib 2.74.6-1 on Arch Linux fixes this server issue...

@eworm-de
Copy link
Contributor Author

We have a reaction from glib upstream asking for more information... But I am not that familiar with the nbd code.
@yoe, can you comment the discussion and add some insight?

@gavinhungry
Copy link

gavinhungry commented Apr 4, 2023

Also seeing this with glib2 2.76.1 on Arch Linux.

As an imperfect alternative to downgrading glib2, you can also override the nbd.service unit (if using systemd) to disable forking with -d, as pointed out by @eworm-de in #146 (comment):

$ sudo systemctl edit nbd.service
# /etc/systemd/system/nbd.service.d/override.conf

[Service]
Type=simple
ExecStart=
ExecStart=/usr/bin/nbd-server -d
Restart=always
$ sudo systemctl daemon-reload
$ sudo systemctl restart nbd.service

@hmmsjan
Copy link

hmmsjan commented Apr 21, 2023

Here on the new Fedora 38 the same problem.
I tried some debugging. It looks like a thread pool does not survive a fork, even in a small test program. After moving "g_thread_pool_new" into "spawn_child", the server works, but I do not know whether, apart from undefined genconf.threads, new problems will be introduced

The test program printing some output via g_thread_pool_push runs in Debian 11, but freezes in Fedora 38 if the pool is created before the fork.

Digging a bit further with the example program:
Just behind g_thread_pool_new, a second thread is created:

Id Target Id Frame

  • 1 Thread 0x7ffff7ca6740 (LWP 5841) "ex1" main () at ex1.cpp:26
    2 Thread 0x7ffff78206c0 (LWP 5844) "pool-spawner" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

If I understand correctly, threads are not duplicated during fork, so probably having a g_thread_pool_new before and g_thread_pool_push after a fork is a recipe for waiting forever. I assume some rearrangements in nbd-server are required....

@yoe
Copy link
Member

yoe commented Apr 23, 2023

Okay, thanks for that.

I don't think we do anything important with the thread pool before the fork, so moving it to be called after the fork should be doable. I'll look at that and publish a new version ASAP.

@hmmsjan
Copy link

hmmsjan commented Apr 23, 2023

I made a littile patch for Fedora, leaving the pool_new on the same place if nofork and doing the pool_new after the fork if !nofork. Plus extra parameter to get the number of threads in that function. It compiled and performed all tests.
And I could play a DVD on nbd0 and swap on nbd1, so this looks promising.

@eworm-de
Copy link
Contributor Author

Can you share your patch and/or open a pull request?

@yoe yoe closed this as completed in 5dba772 Apr 27, 2023
yoe added a commit that referenced this issue Apr 27, 2023
We created the thread pool at the main initialization, before forking
off a child. This used to work just fine, but as of GLib 2.76, this no
longer works due to changes internal to the implementation of
GThreadPool.

Since we don't need to use the thread pool before the fork() call
anyway, stop trying to do so and avoid the problem altogether.

Closes: gh-146
Signed-off-by: Wouter Verhelst <w@uter.be>
@eworm-de
Copy link
Contributor Author

Thanks a lot!

archlinux-github pushed a commit to archlinux/svntogit-community that referenced this issue Apr 27, 2023
NetworkBlockDevice/nbd#146

git-svn-id: file:///srv/repos/svn-community/svn@1448938 9fca08f4-af9d-4005-b8df-a31f2cc04f65
archlinux-github pushed a commit to archlinux/svntogit-community that referenced this issue Apr 27, 2023
NetworkBlockDevice/nbd#146


git-svn-id: file:///srv/repos/svn-community/svn@1448938 9fca08f4-af9d-4005-b8df-a31f2cc04f65
@hmmsjan
Copy link

hmmsjan commented Apr 27, 2023

Sorry for the delay. I see the problem has been solved in meantime. This latest version has some more differences with the Fedora provided version 3.24. Based on the Fedora version 3.24, only move of the g_thread_pool_new call to the location as in the latest version breaks the inetd test. The patch below passes all tests during build of Fedora 38 rpm.

--- nbd-3.24/nbd-server.c.orig  2023-04-27 22:53:38.526801760 +0200
+++ nbd-3.24/nbd-server.c       2023-04-27 22:52:22.954792077 +0200
@@ -3018,8 +3018,8 @@
                         return;
                 }
                 /* Child just continues. */
+               tpool = g_thread_pool_new(handle_request, NULL, genconf->threads, FALSE, NULL);
         }
-
         sock_flags_old = fcntl(net, F_GETFL, 0);
         if (sock_flags_old == -1) {
                 msg(LOG_ERR, "Failed to get socket flags");
@@ -3704,7 +3704,8 @@
 #if HAVE_OLD_GLIB
        g_thread_init(NULL);
 #endif
-       tpool = g_thread_pool_new(handle_request, NULL, genconf.threads, FALSE, NULL);
+       if(dontfork)
+               tpool = g_thread_pool_new(handle_request, NULL, genconf.threads, FALSE, NULL);

        setup_servers(servers, genconf.modernaddr, genconf.modernport,
                        genconf.unixsock, genconf.flags);


@cheese
Copy link

cheese commented Apr 28, 2023

Any chance we have a new release?

Yarny0 added a commit to Yarny0/nixpkgs that referenced this issue Apr 30, 2023
The update to GLib 2.75.3 broke nbd ("Log limit exceeded").

NixOS@8e5ee71

Upstream issue:

NetworkBlockDevice/nbd#146

This commit also applies a second patch from
the upstream development repository so that
the patch that fixes the issue applies cleanly.
@yoe
Copy link
Member

yoe commented May 1, 2023

Any chance we have a new release?

Yes, it's high on my todo list now.

However, before that can happen, I have a few loose ends to finish up; the structured replies patch broke the data logging in the transaction log, and on top of that nbd-trdump also needs to be taught about structured replies. I also don't feel confident enough that the patch which I applied will work sufficiently well; I'd like to do some more in-depth testing.

I'm hoping to be able to push that out sometime next week. Of course, if you feel up to it, help with patches or extra testing is always welcome 😉

@yoe
Copy link
Member

yoe commented May 1, 2023

I'm hoping to be able to push that out sometime next week. Of course, if you feel up to it, help with patches or extra testing is always welcome wink

Scratch that, I found time today. Release should be upcoming soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants