Skip to content
This repository has been archived by the owner on Apr 14, 2023. It is now read-only.

Freezes and timeouts #65

Closed
dom-lgtm opened this issue Jun 29, 2022 · 17 comments
Closed

Freezes and timeouts #65

dom-lgtm opened this issue Jun 29, 2022 · 17 comments

Comments

@dom-lgtm
Copy link
Contributor

Hey,

I have trouble restoring because of timeouts and freezes. Any advice ?

I also notice that the VMs (linux or iOS) often freeze for no reason until I focus the GUI window.

@dom-lgtm
Copy link
Contributor Author

My latest try:

restore-step-ids = {0x1103068B:124}
restore-step-names = {0x1103068B:seal_system_volume}
restore-step-uptime = 1287
restore-step-user-progress = 97
[07:41:05.0563-GMT]{3>6} CHECKPOINT BEGIN: RESTORED:[0x0648] cleanup_send_final_status
restore-step-ids = {0x1103068B:124;0x11030648:128}
restore-step-names = {0x1103068B:seal_system_volume;0x11030648:cleanup_send_final_status}
restore-step-uptime = 1287
restore-step-user-progress = 97

ERROR: Unable to successfully restore device
Checkpoint 1608 complete with code 0
FDR 0x55cc57d7d190 terminating...
ERROR: Unable to restore device

@dom-lgtm
Copy link
Contributor Author

The VM then rebooted and panicked after some time:

panic(cpu 0 caller 0xfffffff02735ccc8): watchdog timeout: no checkins from watchdogd since boot (601 seconds ago)
Debugger message: panic
Memory ID: 0x0
OS release type: User
OS version: 19F77
Kernel version: Darwin Kernel Version 21.5.0: Thu Apr 21 21:51:27 PDT 2022; root:xnu-8020.122.1~1/RELEASE_ARM64_T8030
Kernel UUID: 8DA68E05-5BBB-3D3A-ACF5-7EA227ADDFD7
iBoot version: qemu-t8030
secure boot?: YES
Paniclog version: 13
Kernel slide:      0x000000001d908000
Kernel text base:  0xfffffff02490c000
mach_absolute_time: 0x3613ae4f0
Epoch Time:        sec       usec
  Boot    : 0x62bc031f 0x000ce543
  Sleep   : 0x00000000 0x00000000
  Wake    : 0x00000000 0x00000000
  Calendar: 0x62bc0579 0x00031619

Zone info:
  Zone map: 0xffffffe001eb8000 - 0xffffffe601eb8000
  . VM    : 0xffffffe001eb8000 - 0xffffffe0e851c000
  . RO    : 0xffffffe0e851c000 - 0xffffffe1351e8000
  . GEN0  : 0xffffffe1351e8000 - 0xffffffe21b84c000
  . GEN1  : 0xffffffe21b84c000 - 0xffffffe301eb0000
  . GEN2  : 0xffffffe301eb0000 - 0xffffffe3e8518000
  . GEN3  : 0xffffffe3e8518000 - 0xffffffe4ceb80000
  . DATA  : 0xffffffe4ceb80000 - 0xffffffe601eb8000
  Metadata: 0xffffffe72c498000 - 0xffffffe72dc98000
  Bitmaps : 0xffffffe72dc98000 - 0xffffffe72f1b4000

CORE 0 is the one that panicked. Check the full backtrace for details.
CORE 1: PC=0xfffffff0255b6328, LR=0xfffffff0255b6324, FP=0xffffffe73584be90
CORE 2: PC=0xfffffff0255b6328, LR=0xfffffff0255b6324, FP=0xffffffe7358ebe90
CORE 3: PC=0xfffffff0255b6328, LR=0xfffffff0255b6324, FP=0xffffffe73592be90
Compressor Info: 0% of compressed pages limit (OK) and 0% of segments limit (OK) with 0 swapfiles and OK swap space
Total cpu_usage: 48135040
Thread task pri cpu_usage
0xffffffe21b7c8dc0 kernel_task 0 418935
0xffffffe21b7c2940 kernel_task 0 0
0xffffffe21b7b94a0 kernel_task 0 5511560
0xffffffe21b7d94a0 kernel_task 0 0

Panicked task 0xffffffe21b7ec668: 0 pages, 111 threads: pid 0: kernel_task
Panicked thread: 0xffffffe21b7c8dc0, backtrace: 0xffffffe7352bf860, tid: 102
		  lr: 0xfffffff02547472c  fp: 0xffffffe7352bf8a0
		  lr: 0xfffffff025474444  fp: 0xffffffe7352bf910
		  lr: 0xfffffff0255c26c4  fp: 0xffffffe7352bf930
		  lr: 0xfffffff0255b34b0  fp: 0xffffffe7352bf9a0
		  lr: 0xfffffff0255b22dc  fp: 0xffffffe7352bfa60
		  lr: 0xfffffff025bb8610  fp: 0xffffffe7352bfa70
		  lr: 0xfffffff025474120  fp: 0xffffffe7352bfe00
		  lr: 0xfffffff025474120  fp: 0xffffffe7352bfe60
		  lr: 0xfffffff0274f801c  fp: 0xffffffe7352bfe80
		  lr: 0xfffffff02735ccc8  fp: 0xffffffe7352bfeb0
		  lr: 0xfffffff02735c250  fp: 0xffffffe7352bfef0
		  lr: 0xfffffff02735a670  fp: 0xffffffe7352bff10
		  lr: 0xfffffff026dfd490  fp: 0xffffffe7352bffb0
		  lr: 0xfffffff025b02b90  fp: 0xffffffe7352bffc0
		  lr: 0xfffffff0255b3f0c  fp: 0xffffffe7352bffe0
		  lr: 0xfffffff025bb8684  fp: 0xffffffe7352bfff0
		  lr: 0xfffffff0255b6324  fp: 0xffffffe7358abe90
		  lr: 0xfffffff0254a7fb8  fp: 0xffffffe7358abf00
		  lr: 0xfffffff0254a8234  fp: 0xffffffe7358abf20
		  lr: 0xfffffff025bc0920  fp: 0x0000000000000000


** Stackshot Succeeded ** Bytes Traced 49210 (Uncompressed 133568) **
IOPlatformPanicAction -> AppleANS2NVMeController
IOPlatformPanicAction -> AppleT8030PMGR
IOPlatformPanicAction -> AppleARMWatchdogTimer
IOPlatformPanicAction -> AppleNubSynopsysOTG3Device
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> AppleSMC
IOPlatformPanicAction -> AppleANS2NVMeController
IOPlatformPanicAction -> AppleT8030PMGR
IOPlatformPanicAction -> AppleARMWatchdogTimer
IOPlatformPanicAction -> AppleNubSynopsysOTG3Device
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> RTBuddyV2
IOPlatformPanicAction -> AppleSMC

Please go to https://panic.apple.com to report this panic

@TrungNguyen1909
Copy link
Owner

For the second log, you can add wdt=-1 to the quotes after -append to workaround.

The first one is expected for newer iOS.

I also don't have a solution for freezes when the GUI window is not focused yet, but macOS seems to throttle them down.

@dom-lgtm
Copy link
Contributor Author

Thanks I will play a bit more.

@dom-lgtm
Copy link
Contributor Author

I had the same throttling issue on a project a few years back but I never managed to fix it :/

@dom-lgtm
Copy link
Contributor Author

dom-lgtm commented Jul 9, 2022

I fixed the freezes and timeouts with: https://github.com/TrungNguyen1909/qemu-t8030/pull/68/files

@TrungNguyen1909
Copy link
Owner

@dom-lgtm Hello, I get that removing mutex stuff can possibly work, but your PR looks very sketchy to me.

Can you first add --enable-debug-info to the configure command, rebuild (make) the whole thing again.

And then reproduce the hangs and then use gdb/lldb to attach to both the Linux VM and the iOS VM when they hang, do a thread apply all bt (gdb) or bt all (lldb) and then send me the FULL result of them (just the ones in the debugger)?

Thanks.

@dom-lgtm
Copy link
Contributor Author

dom-lgtm commented Jul 9, 2022

With all due respect, it's your use of mutexes I find sketchy. Are you sure you understand mutexes ?

  1. Can you explain why you need to lock iothread mutex in this code?
  2. Why is a thread basically locking the iothread mutex forever?
    qemu_mutex_lock_iothread();
  3. Why do you need unsafe unlocks everywhere to fix 2? (e.g.
    qemu_mutex_unlock_iothread();
    )

/ cc @Siguza

@TrungNguyen1909
Copy link
Owner

Hello @dom-lgtm,

All the mutexes usage in hw/usb/hcd-tcp.c is based off here.

  1. This thread don't lock the mutex forever, if you followed the calls, you would see that it is unlocked eventually at usb_tcp_remote_read_one() -> usb_tcp_remote_read() it unlocks before the read and it relocks after that (because read blocks). The reason for that design is to make sure that all the code in between (processing) are synchronized.

Yes, you can say that all other queue locks are useless (the reason they are there is historical). But I want all processing code to be inside iothread lock due to the fact that EHCI runs inside IOThread (implicitly through qemu_bh).

And with all due respects, please send me the backtrace I requested so that we can figured this out.

Thanks.

@dom-lgtm
Copy link
Contributor Author

dom-lgtm commented Jul 9, 2022

@TrungNguyen1909 Do I understand correctly that if there is some data available to read on the socket, the iothread is basically never unlocked except for the few instructions during which another thread might, with lots of luck, manage to lock thread for its own use, which could explain the lockups and why my fix removes the lockups?

@TrungNguyen1909
Copy link
Owner

TrungNguyen1909 commented Jul 9, 2022

@dom-lgtm,
Without your backtrace, I am unable to explain the lockups for you.

Well yes, assuming read constantly has data coming in, the IOThread is mostly locked.

Ummm, but however, I figured that maybe we should remove all other locks accept for the IOThread and make sure that no deadlocks are possible, plus every processing bits is IOThread locked, then we can still be thread-safe.

@dom-lgtm
Copy link
Contributor Author

dom-lgtm commented Jul 12, 2022

Sorry I was out for the weekend. Here is a BT from the linux VM (both linux and iOS hang).

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff8097c797a libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007ff8097c7ce8 libsystem_kernel.dylib`mach_msg + 56
    frame #2: 0x00007ff8098cb38d CoreFoundation`__CFRunLoopServiceMachPort + 319
    frame #3: 0x00007ff8098c9a18 CoreFoundation`__CFRunLoopRun + 1276
    frame #4: 0x00007ff8098c8e5c CoreFoundation`CFRunLoopRunSpecific + 562
    frame #5: 0x00007ff8125705e6 HIToolbox`RunCurrentEventLoopInMode + 292
    frame #6: 0x00007ff81257034a HIToolbox`ReceiveNextEventCommon + 594
    frame #7: 0x00007ff8125700e5 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 70
    frame #8: 0x00007ff80c3081fd AppKit`_DPSNextEvent + 927
    frame #9: 0x00007ff80c3068ba AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 1394
    frame #10: 0x00007ff80c2f8f69 AppKit`-[NSApplication run] + 586
    frame #11: 0x000000010bf595cd qemu-system-aarch64`main(argc=<unavailable>, argv=<unavailable>) at cocoa.m:1989:5 [opt]
    frame #12: 0x0000000117d6751e dyld`start + 462
  thread #2
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c7905a7 qemu-system-aarch64`qemu_event_wait [inlined] qemu_futex_wait(ev=0x000000010ccf8738, val=4294967295) at qemu-thread-posix.c:386:9 [opt]
    frame #3: 0x000000010c790587 qemu-system-aarch64`qemu_event_wait(ev=0x000000010ccf8738) at qemu-thread-posix.c:481:9 [opt]
    frame #4: 0x000000010c7999c0 qemu-system-aarch64`call_rcu_thread(opaque=0x0000000000000000) at rcu.c:261:21 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #3
    frame #0: 0x00007ff8097cfd5a libsystem_kernel.dylib`__select + 10
    frame #1: 0x000000010db3ae0f libglib-2.0.0.dylib`g_poll + 505
    frame #2: 0x000000010c7ac017 qemu-system-aarch64`main_loop_wait [inlined] os_host_main_loop_wait(timeout=61000) at main-loop.c:250:11 [opt]
    frame #3: 0x000000010c7abfd8 qemu-system-aarch64`main_loop_wait(nonblocking=<unavailable>) at main-loop.c:531:11 [opt]
    frame #4: 0x000000010bf96157 qemu-system-aarch64`qemu_main_loop at runstate.c:727:9 [opt]
    frame #5: 0x000000010bf28ede qemu-system-aarch64`qemu_main(argc=50, argv=0x00007ff7b3fdea20, envp=<unavailable>) at main.c:50:5 [opt]
    frame #6: 0x000000010bf59636 qemu-system-aarch64`call_qemu_main(opaque=0x0000000000000000) at cocoa.m:1939:14 [opt]
    frame #7: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #8: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #9: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #4
    frame #0: 0x00007ff8097d0512 libsystem_kernel.dylib`__sigwait + 10
    frame #1: 0x00007ff809804c25 libsystem_pthread.dylib`sigwait + 29
    frame #2: 0x000000010c78dde9 qemu-system-aarch64`sigwait_compat(opaque=0x00006000010fc980) at compatfd.c:36:15 [opt]
    frame #3: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #4: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #5: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #5
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c78fe4a qemu-system-aarch64`qemu_cond_wait_impl(cond=0x00006000007c8cc0, mutex=0x000000010cce2c00, file="../softmmu/cpus.c", line=424) at qemu-thread-posix.c:195:11 [opt]
    frame #3: 0x000000010bf8fbc8 qemu-system-aarch64`qemu_wait_io_event(cpu=0x00007f8f1f7c5000) at cpus.c:424:9 [opt]
    frame #4: 0x000000010c640221 qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x00007f8f1f7c5000) at tcg-accel-ops-mttcg.c:124:9 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #6
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c78fe4a qemu-system-aarch64`qemu_cond_wait_impl(cond=0x00006000007c9680, mutex=0x000000010cce2c00, file="../softmmu/cpus.c", line=424) at qemu-thread-posix.c:195:11 [opt]
    frame #3: 0x000000010bf8fbc8 qemu-system-aarch64`qemu_wait_io_event(cpu=0x00007f8f2000b000) at cpus.c:424:9 [opt]
    frame #4: 0x000000010c640221 qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x00007f8f2000b000) at tcg-accel-ops-mttcg.c:124:9 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #7
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c78fe4a qemu-system-aarch64`qemu_cond_wait_impl(cond=0x00006000007e9580, mutex=0x000000010cce2c00, file="../softmmu/cpus.c", line=424) at qemu-thread-posix.c:195:11 [opt]
    frame #3: 0x000000010bf8fbc8 qemu-system-aarch64`qemu_wait_io_event(cpu=0x00007f8f200be000) at cpus.c:424:9 [opt]
    frame #4: 0x000000010c640221 qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x00007f8f200be000) at tcg-accel-ops-mttcg.c:124:9 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #8
    frame #0: 0x00007ff8097ca3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff809804a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x000000010c78fe4a qemu-system-aarch64`qemu_cond_wait_impl(cond=0x00006000007ca0c0, mutex=0x000000010cce2c00, file="../softmmu/cpus.c", line=424) at qemu-thread-posix.c:195:11 [opt]
    frame #3: 0x000000010bf8fbc8 qemu-system-aarch64`qemu_wait_io_event(cpu=0x00007f8f20171000) at cpus.c:424:9 [opt]
    frame #4: 0x000000010c640221 qemu-system-aarch64`mttcg_cpu_thread_fn(arg=0x00007f8f20171000) at tcg-accel-ops-mttcg.c:124:9 [opt]
    frame #5: 0x000000010c79075e qemu-system-aarch64`qemu_thread_start(args=<unavailable>) at qemu-thread-posix.c:556:9 [opt]
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #9, name = 'com.apple.NSEventThread'
    frame #0: 0x00007ff8097c797a libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007ff8097c7ce8 libsystem_kernel.dylib`mach_msg + 56
    frame #2: 0x00007ff8098cb38d CoreFoundation`__CFRunLoopServiceMachPort + 319
    frame #3: 0x00007ff8098c9a18 CoreFoundation`__CFRunLoopRun + 1276
    frame #4: 0x00007ff8098c8e5c CoreFoundation`CFRunLoopRunSpecific + 562
    frame #5: 0x00007ff80c475c1e AppKit`_NSEventThread + 132
    frame #6: 0x00007ff8098044e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #7: 0x00007ff8097fff6b libsystem_pthread.dylib`thread_start + 15
  thread #10
    frame #0: 0x00007ff8097fff48 libsystem_pthread.dylib`start_wqthread
(lldb)

@dom-lgtm
Copy link
Contributor Author

Well yes, assuming read constantly has data coming in, the IOThread is mostly locked.

Could this by chance be the case when the firmware is being transmitted and coincidently when the VMs most often lock ?

@TrungNguyen1909
Copy link
Owner

I couldn't see anything strange here, Can you send me the bt for the iOS VM?

@dom-lgtm
Copy link
Contributor Author

dom-lgtm commented Jul 12, 2022

I can't claim I know much about qemu or multi-threaded code, but I can't help wonder if spinning is the best choice here:

while ((qatomic_mb_read(&inflightPacket.handled) & 1) == 0) {

@TrungNguyen1909
Copy link
Owner

TrungNguyen1909 commented Jul 12, 2022

@dom-lgtm,
You might be wondering but there is a reason for that.

The QEMU USB stack does not accept USB_RET_NAK after a packet is USB_RET_ASYNC,
and the only way to defer the completion of a USB packet is USB_RET_ASYNC.

So the solution is we have to wait for the other side to respond.
It might not be the best way to workaround this here but that's the only one we currently got.

@dom-lgtm
Copy link
Contributor Author

This code is too complex for me. I'm going to work on my own fork so I don't mess the mutexes.

@dom-lgtm dom-lgtm closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants