Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFX connection reliably crashes with SIGABRT #2968

Closed
metalefty opened this issue Feb 26, 2024 · 22 comments · Fixed by #2969
Closed

GFX connection reliably crashes with SIGABRT #2968

metalefty opened this issue Feb 26, 2024 · 22 comments · Fixed by #2969

Comments

@metalefty
Copy link
Member

metalefty commented Feb 26, 2024

This issue is likely to be identical to the third issue I reported in #2942 (comment).

Testing the latest v0.10 branch c3cb855 on AlmaLinux 9.3.

Here's a simple program to reproduce: https://gist.github.com/metalefty/99b688ec67e5debb4d662fb6a12fc069

  1. Connect to xrdp session from 4k monitor (I tested 3840x2160)
  2. Start terminal emulator and maximize the window
  3. Run the program to reproduce
    • ./a.out 200 300
    • adjust cols arguments to fit your window, and set rows to some big value for example 200.
  4. xrdp process crashes with SIGABRT when the output of the program drawn
[2024-02-26T04:01:33.731+0000] [ERROR] Child 194657 terminated unexpectedly with signal SIGABRT

When trying to reconnect, xrdp process crashes SIGABRT as soon as reconnected because the same screen will be drawn again. If resize happens when reconnecting (reconnect from a different size), it gets back working. As far as I tested, this issue occurs when the session size is 4k. I can't reproduce this issue with 2560x1600 geometry. It is a smaller size than 3840x2160. It doesn't crash when the cols are less than 1/2 of the total screen width.

paint

On FreeBSD, xrdp process doesn't crash but I get a black blank screen instead. As same as AlmaLinux, it gets back to working when I make resizing happen (moving mstsc to other monitors). It is something related to the GFX encoder.

I also tried v0.10 + cherry-pick #2962 but it doesn't affect this issue.

@Nexarian I remember you have a 4k monitor. Can you confirm if you can reproduce this issue?

@Nexarian
Copy link
Contributor

Yup, I will be able to take a look tomorrow night (~7 PM EST on 2/27 and onwards)

My first guess is this could be some sort of memory limit that we're hitting.

@matt335672
Copy link
Member

I can reproduce this with a Windows VM with a 4K screen size (even though it doesn't fit on my main monitor!)

Stack trace is:-

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140029321954880) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140029321954880) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140029321954880, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f5b1ec42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f5b1ec287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f5b1ec89676 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f5b1eddbb77 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007f5b1eca0cfc in malloc_printerr (str=str@entry=0x7f5b1edd97ec "malloc(): corrupted top size") at ./malloc/malloc.c:5664
#7  0x00007f5b1eca46f2 in _int_malloc (av=av@entry=0x7f5b18000030, bytes=bytes@entry=3153864) at ./malloc/malloc.c:4373
#8  0x00007f5b1eca5139 in __GI___libc_malloc (bytes=bytes@entry=3153864) at ./malloc/malloc.c:3329
#9  0x000055726193135e in xrdp_egfx_wire_to_surface2 (bulk=bulk@entry=0x0, surface_id=surface_id@entry=0, codec_id=codec_id@entry=9, codec_context_id=codec_context_id@entry=0, pixel_format=pixel_format@entry=32, 
    bitmap_data=bitmap_data@entry=0x7f5b1b302e70, bitmap_data_length=3145437) at xrdp_egfx.c:593
#10 0x00005572619136e4 in gfx_wiretosurface2 (self=self@entry=0x557265dc5bb0, bulk=bulk@entry=0x0, in_s=in_s@entry=0x7f5b1dfe39e0, enc_gfx_cmd=enc_gfx_cmd@entry=0x557265d4d960) at xrdp_encoder.c:696
#11 0x0000557261914595 in process_enc_egfx (self=0x557265dc5bb0, enc=0x557265d4d940) at xrdp_encoder.c:946
#12 0x0000557261914b8f in proc_enc_msg (arg=0x557265dc5bb0) at xrdp_encoder.c:1090
#13 0x00007f5b1ec94ac3 in start_thread (arg=<optimised out>) at ./nptl/pthread_create.c:442
#14 0x00007f5b1ed26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

I'll try running with valgrind and see if I can pick up the initial memory violation

@matt335672
Copy link
Member

I can see what's happening and I have a fix, but it will need @Nexarian and/or @jsorg71 to work out why it works. I think we need to do this.

valgrind trace is :-

==4019== Thread 2:
==4019== Invalid write of size 1
==4019==    at 0x14915F: rfx_encode_diff_rlgr1 (rfxencode_diff_rlgr1.c:239)
==4019==    by 0x144613: rfx_pro_compose_message_region (rfxencode_compose.c:884)
==4019==    by 0x144A44: rfx_pro_compose_message_data (rfxencode_compose.c:952)
==4019==    by 0x141092: rfxcodec_encode_ex (rfxencode.c:403)
==4019==    by 0x1411C3: rfxcodec_encode (rfxencode.c:439)
==4019==    by 0x119684: gfx_wiretosurface2 (xrdp_encoder.c:677)
==4019==    by 0x11A594: process_enc_egfx (xrdp_encoder.c:946)
==4019==    by 0x11AB8E: proc_enc_msg (xrdp_encoder.c:1090)
==4019==    by 0x4ABFAC2: start_thread (pthread_create.c:442)
==4019==    by 0x4B50A03: clone (clone.S:100)
==4019==  Address 0xf52f040 is 0 bytes after a block of size 3,145,728 alloc'd
==4019==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==4019==    by 0x1195D3: gfx_wiretosurface2 (xrdp_encoder.c:669)
==4019==    by 0x11A594: process_enc_egfx (xrdp_encoder.c:946)
==4019==    by 0x11AB8E: proc_enc_msg (xrdp_encoder.c:1090)
==4019==    by 0x4ABFAC2: start_thread (pthread_create.c:442)
==4019==    by 0x4B50A03: clone (clone.S:100)
==4019== 
==4019== Invalid write of size 1
==4019==    at 0x1490C5: rfx_encode_diff_rlgr1 (rfxencode_diff_rlgr1.c:238)
==4019==    by 0x144613: rfx_pro_compose_message_region (rfxencode_compose.c:884)
==4019==    by 0x144A44: rfx_pro_compose_message_data (rfxencode_compose.c:952)
==4019==    by 0x141092: rfxcodec_encode_ex (rfxencode.c:403)
==4019==    by 0x1411C3: rfxcodec_encode (rfxencode.c:439)
==4019==    by 0x119684: gfx_wiretosurface2 (xrdp_encoder.c:677)
==4019==    by 0x11A594: process_enc_egfx (xrdp_encoder.c:946)
==4019==    by 0x11AB8E: proc_enc_msg (xrdp_encoder.c:1090)
==4019==    by 0x4ABFAC2: start_thread (pthread_create.c:442)
==4019==    by 0x4B50A03: clone (clone.S:100)
==4019==  Address 0xf52f041 is 1 bytes after a block of size 3,145,728 alloc'd
==4019==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==4019==    by 0x1195D3: gfx_wiretosurface2 (xrdp_encoder.c:669)
==4019==    by 0x11A594: process_enc_egfx (xrdp_encoder.c:946)
==4019==    by 0x11AB8E: proc_enc_msg (xrdp_encoder.c:1090)
==4019==    by 0x4ABFAC2: start_thread (pthread_create.c:442)
==4019==    by 0x4B50A03: clone (clone.S:100)
==4019== 
==4019== Thread 1:
==4019== Syscall param socketcall.sendto(msg) points to uninitialised byte(s)
==4019==    at 0x4B528FE: __libc_send (send.c:28)
==4019==    by 0x4B528FE: send (send.c:23)
==4019==    by 0x486D854: g_sck_send (os_calls.c:1406)
==4019==    by 0x4876172: trans_tcp_send (trans.c:84)
==4019==    by 0x48770E4: trans_write_copy_s (trans.c:624)
==4019==    by 0x48C4158: xrdp_iso_send (xrdp_iso.c:637)
==4019==    by 0x48CA0F6: xrdp_mcs_send (xrdp_mcs.c:1473)
==4019==    by 0x48E786D: xrdp_sec_send (xrdp_sec.c:1823)
==4019==    by 0x48BDE14: xrdp_channel_send (xrdp_channel.c:150)
==4019==    by 0x48C1099: xrdp_channel_drdynvc_data_first (xrdp_channel.c:1002)
==4019==    by 0x489920F: libxrdp_drdynvc_data_first (libxrdp.c:1458)
==4019==    by 0x1345DE: xrdp_egfx_send_data (xrdp_egfx.c:64)
==4019==    by 0x12CC75: xrdp_mm_process_enc_done (xrdp_mm.c:3469)
==4019==  Address 0x9848443 is 19 bytes inside a block of size 32,768 alloc'd
==4019==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==4019==    by 0x48C3E85: xrdp_iso_init (xrdp_iso.c:603)
==4019==    by 0x48C9CAE: xrdp_mcs_init (xrdp_mcs.c:1380)
==4019==    by 0x48E31F5: xrdp_sec_init (xrdp_sec.c:647)
==4019==    by 0x48BDBED: xrdp_channel_init (xrdp_channel.c:90)
==4019==    by 0x48C0EC0: xrdp_channel_drdynvc_data_first (xrdp_channel.c:980)
==4019==    by 0x489920F: libxrdp_drdynvc_data_first (libxrdp.c:1458)
==4019==    by 0x1345DE: xrdp_egfx_send_data (xrdp_egfx.c:64)
==4019==    by 0x12CC75: xrdp_mm_process_enc_done (xrdp_mm.c:3469)
==4019==    by 0x12D43D: xrdp_mm_check_wait_objs (xrdp_mm.c:3717)
==4019==    by 0x140079: xrdp_wm_check_wait_objs (xrdp_wm.c:2351)
==4019==    by 0x133F93: xrdp_process_main_loop (xrdp_process.c:287)
==4019== 

The memory buffer which isn't big enough is allocated at xrdp_encoder.c:669:-

xrdp/xrdp/xrdp_encoder.c

Lines 659 to 670 in c3cb855

if (ENC_IS_BIT_SET(flags, 0))
{
/* already compressed */
bitmap_data_length = enc_gfx_cmd->data_bytes;
bitmap_data = enc_gfx_cmd->data;
do_send = 1;
}
else
{
bitmap_data_length = self->max_compressed_bytes;
bitmap_data = g_new(char, bitmap_data_length);
if (bitmap_data == NULL)

and self->max_uncompressed_bytes is initialised here:-

xrdp/xrdp/xrdp_encoder.c

Lines 190 to 195 in c3cb855

if (client_info->gfx)
{
// Magic numbers... Why?
self->frames_in_flight = 2;
self->max_compressed_bytes = 3145728;
}

I've done some playing around, and if I set max_uncompressed_bytes to 6MB (i.e. 6 * 1024 * 1024), all is well. I still get errors however if I only set it to 4MB (i.e. 4 * 1024 * 1024)

I'm unable to comment further as I don't understand the 'magic number' comment.

@metalefty
Copy link
Member Author

Thanks @matt335672. It makes sense it happens only 4k geometry with a big screen update.

@jsorg71
Copy link
Contributor

jsorg71 commented Feb 26, 2024

It looks like I added it in sha1 966a819 in the origin egfx branch in 09/20/2020. I don't know where I got that and can't find any limit in MS-RDPEGFX. librfxcodec should not crash though, rfx_encode_diff_rlgr1 should be checking the out buffer. I'll check on that.
RemoteFX codec has both these settings but I don't know if GFX does.
It's probably OK to increase it.

@Nexarian
Copy link
Contributor

I called it a magic number precisely because I didn't know why it was there, and was largely copy/pasting! Thanks Jay for the follow-up. I'm pretty sure increasing it is fine as well.

@metalefty
Copy link
Member Author

Let's increase it to enough size for 4K dual monitor (maybe 12 * 1024 * 1024?).

@jsorg71
Copy link
Contributor

jsorg71 commented Feb 27, 2024

@metalefty I think it only needs to be as big as the biggest monitor. The shared memory and the encoder work on a monitor at a time. If there is no monitors and a huge desktop, it's like you say.

@matt335672
Copy link
Member

However, if there are no monitors and a huge desktop (e.g. if using VNC) we don't use the encoder. So I think this problem only occurs for GFX.

@matt335672
Copy link
Member

Sorry - I mean GFX with XUP.

@metalefty
Copy link
Member Author

Jay, thanks for the info.

What about assuming a 16k x 16k per monitor as a maximum size for the moment then? It is the maximum size that xrandr (xorgxrdp?) reports.

$ xrandr
Screen 0: minimum 256 x 256, current 1920 x 1080, maximum 16384 x 16384
rdp0 connected primary 1920x1080+0+0 0mm x 0mm
   1920x1080     50.00* 

@matt335672
Copy link
Member

16k x 16k gives us a value of 256Mb. That's assuming that @metalefty's test program is producing worst case sizes, and it may not be.

We can also set the size based on the largest monitor:-

--- a/xrdp/xrdp_encoder.c
+++ b/xrdp/xrdp_encoder.c
@@ -85,6 +85,40 @@ xrdp_enc_data_done_destructor(void *item, void *closure)
     g_free(enc_done);
 }
 
+/*****************************************************************************/
+static unsigned int
+get_largest_monitor_pixels(struct xrdp_mm *mm)
+{
+    int max_pixels;
+
+    struct xrdp_client_info *client_info = mm->wm->client_info;
+    struct display_size_description *display_sizes;
+    display_sizes = &client_info->display_sizes;
+
+    if (display_sizes->monitorCount < 1)
+    {
+        max_pixels = display_sizes->session_width *
+                     display_sizes->session_height;
+    }
+    else
+    {
+        max_pixels = 0;
+        struct monitor_info *minfo = display_sizes->minfo;
+        int i;
+        for (i = 0 ; i < display_sizes->monitorCount; ++i)
+        {
+            unsigned int pixels = (minfo[i].right + 1) - minfo[i].left;
+            pixels *= (minfo[i].bottom + 1) - minfo[i].top;
+            if (pixels > max_pixels)
+            {
+                max_pixels = pixels;
+            }
+        }
+    }
+
+    return max_pixels;
+}
+
 /*****************************************************************************/
 struct xrdp_encoder *
 xrdp_encoder_create(struct xrdp_mm *mm)
@@ -189,9 +223,12 @@ xrdp_encoder_create(struct xrdp_mm *mm)
     self->xrdp_encoder_term = g_create_wait_obj(buf);
     if (client_info->gfx)
     {
-        // Magic numbers... Why?
+        // Assume compressor needs to cope with largest monitor with
+        // ineffective compression
         self->frames_in_flight = 2;
-        self->max_compressed_bytes = 3145728;
+        self->max_compressed_bytes = get_largest_monitor_pixels(mm) * 4;
+        LOG_DEVEL(LOG_LEVEL_INFO, "Using %d max_compressed_bytes for encoder",
+                  self->max_compressed_bytes);
     }
     else
     {

That gives us 32Mb for a 4K monitor. Also, we're allocating max_compressed_bytes for each encoder in multimon which is somewhat wasteful.

What so you think @jsorg71 ?

@jsorg71
Copy link
Contributor

jsorg71 commented Feb 27, 2024

Ah, that looks best. Didn't test but that's the idea.

@metalefty
Copy link
Member Author

Wow, 256M per monitor per user is huge and wasteful. That looks best to me, too.

@matt335672
Copy link
Member

I'll put a PR together in the next hour or so.

Jay - can you look at whether we need to test the buffer while we're writing into it?

@jsorg71
Copy link
Contributor

jsorg71 commented Feb 27, 2024

@matt335672 yes, sure I am already. In fact I'm confused on why it crashes. I'm planning to set max_compressed_bytes really low and debug what's happening. I'll create another issue for this so we can close this one.

@matt335672
Copy link
Member

Reopened as it needs more investigation.

@jsorg71
Copy link
Contributor

jsorg71 commented Mar 14, 2024

This is taking longer than expected to fix. The way RFX works is that 'tiles_written' is returned and if that is not the whole tile set, it's called again.
This does not work with GFX because we can't have missing tiles in the tile set. Remember the green tiles in #2875 . This was caused by the missing tiles.
So I have to fix this a different way. Still making multiple calls to librfxcodec but using a region to control the tile set.

@jsorg71
Copy link
Contributor

jsorg71 commented Mar 18, 2024

To my surprise, what I said last is not true. It looks like as long as all the tiles are send before the end frame we're OK. I wish there was documentation on how the works.
I have working librfxcodec and xrdp changes to fix this without the huge buffer.
I'll prepare a PR for each.

@jsorg71
Copy link
Contributor

jsorg71 commented Mar 23, 2024

should be fixed with #3013 and neutrinolabs/librfxcodec#61

@matt335672
Copy link
Member

That's great news @jsorg71.

I'll test this on my own setup tomorrow (UK time).

@metalefty
Copy link
Member Author

This one can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants