New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syncthing startup hang #7392
Comments
Can you get a backtrace from the hang? Does the hang appear to be at the CPUID instruction itself (as if the instruction never finishes)? Can the hung process be killed? Does a simple program that calls CPUID manually work? CPUID is intercepted by Xen and emulated in the hypervisor, so this could be a Xen bug. |
Here is the full backtrace and code around the hang using delve (go debugger):
I believe that it appears to be on line 21 as the CPUID instruction is executed, incrementing the line counter, but it never finishes, causing the hang. At least that is how it seems. Yes, the hung process can be killed. However, a simple assembly program calling CPUID does not hang, which is pretty strange. |
That seems to indicate that whatever is going on is not the process being stuck in Xen, since that wouldn’t be killable.
Can you try calling manually CPUID with the exact same parameters as the failing call? This seems like a strange low-level bug in either Xen or Linux. Would you by any chance be able to see what happens in kernel mode when the CPUID instruction is issued? |
Further investigation has revealed (not sure how I didn't pick up some of this earlier); At this loop, this loop never ends, creating the illusion that CPUID never ends (for somereason whenever I CTRL-Cd it always stopped around CPUID, probably due to those instructions being most intensive). On my AMD Ryzen 1600AF (aka basically 2600), the cpuidex call (CPUID extended, includes an extra argument), no matter the value of the second argument (which is what is being incremented), it returns the exact same values. I think this is incorrect but I cannot find data to back that up either way. On my Intel i5-7200U laptop, after complete system upgrades, I cannot reproduce this issue. Seems to be limited to AMD systems? (I don't have more AMD hardware to test on). In my second Qubes install for testing things (on a separate drive), I can reproduce this problem, but NOT if I run the syncthing binary in dom0 (any casual readers, please do not do this, the security implications are possible complete system compromise). When I next have some spare time to investigate this, I will see whether the same call of CPUID in dom0 and a qube produces the same output. (Hypothesis: It will not). |
Can you say what specifically values are returned there? |
|
This is a bug in Xen with how we derive CPUID data for guests. Leaf 0x8000001d shouldn't be exposed because it's not yet tied into the topology logic (the number of CPUs sharing this cache needs adjusting based on the number of vCPUs the VM has.) |
Ah, makes sense, but how come this worked in the past then? |
It will be a consequence of https://github.com/QubesOS/qubes-vmm-xen/blob/xen-4.14/patch-stable-0002-x86-cpuid-support-LFENCE-always-serialising-CPUID-bi.patch being backported for speculation reasons. Before that patch, 0x8000001c was reported as the max extended leaf, so userspace wouldn't have gone looking at 0x8000001d. |
Actually, there is also a bug in whatever piece of userspace is looping here. The AMD manual states that the contents of 0x8000001d and 0x8000001e are only valid if the TOPOEXT feature (CPUID.80000001.edx[22]) is visible. Xen hides this feature, so userspace shouldn't go looking. |
Time to be mean and inject #GP into the buggy guest userspace? (Only half-joking.) |
@fosslinux Any chance I can have a name & email address for a Reported-by tag on the upstream fix? |
I don't use my real name (at least at this time). So if |
c/s 1a91425 increased the AMD max leaf from 0x8000001c to 0x80000021, but did not adjust anything in the calculate_*_policy() chain. As a result, on hardware supporting these leaves, we read the real hardware values into the raw policy, then copy into host, and all the way into the PV/HVM default policies. All 4 of these leaves have enable bits (first two by TopoExt, next by SEV, next by PQOS), so any software following the rules is fine and will leave them alone. However, leaf 0x8000001d takes a subleaf input and at least two userspace utilities have been observed to loop indefinitely under Xen (clearly waiting for eax to report "no more cache levels"). Such userspace is buggy, but Xen's behaviour isn't great either. In the short term, clobber all information in these leaves. This is a giant bodge, but there are complexities with implementing all of these leaves properly. Fixes: 1a91425 ("x86/cpuid: support LFENCE always serialising CPUID bit") Link: QubesOS/qubes-issues#7392 Reported-by: fosslinux <fosslinux@aussies.space> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Is there any current workaround for this issue? |
@marmarek time to pop out a new Xen package? |
c/s 1a91425 increased the AMD max leaf from 0x8000001c to 0x80000021, but did not adjust anything in the calculate_*_policy() chain. As a result, on hardware supporting these leaves, we read the real hardware values into the raw policy, then copy into host, and all the way into the PV/HVM default policies. All 4 of these leaves have enable bits (first two by TopoExt, next by SEV, next by PQOS), so any software following the rules is fine and will leave them alone. However, leaf 0x8000001d takes a subleaf input and at least two userspace utilities have been observed to loop indefinitely under Xen (clearly waiting for eax to report "no more cache levels"). Such userspace is buggy, but Xen's behaviour isn't great either. In the short term, clobber all information in these leaves. This is a giant bodge, but there are complexities with implementing all of these leaves properly. Fixes: 1a91425 ("x86/cpuid: support LFENCE always serialising CPUID bit") Link: QubesOS/qubes-issues#7392 Reported-by: fosslinux <fosslinux@aussies.space> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: d4012d5 master date: 2022-04-07 11:36:45 +0100
c/s 1a91425 increased the AMD max leaf from 0x8000001c to 0x80000021, but did not adjust anything in the calculate_*_policy() chain. As a result, on hardware supporting these leaves, we read the real hardware values into the raw policy, then copy into host, and all the way into the PV/HVM default policies. All 4 of these leaves have enable bits (first two by TopoExt, next by SEV, next by PQOS), so any software following the rules is fine and will leave them alone. However, leaf 0x8000001d takes a subleaf input and at least two userspace utilities have been observed to loop indefinitely under Xen (clearly waiting for eax to report "no more cache levels"). Such userspace is buggy, but Xen's behaviour isn't great either. In the short term, clobber all information in these leaves. This is a giant bodge, but there are complexities with implementing all of these leaves properly. Fixes: 1a91425 ("x86/cpuid: support LFENCE always serialising CPUID bit") Link: QubesOS/qubes-issues#7392 Reported-by: fosslinux <fosslinux@aussies.space> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: d4012d5 master date: 2022-04-07 11:36:45 +0100
c/s 1a91425 increased the AMD max leaf from 0x8000001c to 0x80000021, but did not adjust anything in the calculate_*_policy() chain. As a result, on hardware supporting these leaves, we read the real hardware values into the raw policy, then copy into host, and all the way into the PV/HVM default policies. All 4 of these leaves have enable bits (first two by TopoExt, next by SEV, next by PQOS), so any software following the rules is fine and will leave them alone. However, leaf 0x8000001d takes a subleaf input and at least two userspace utilities have been observed to loop indefinitely under Xen (clearly waiting for eax to report "no more cache levels"). Such userspace is buggy, but Xen's behaviour isn't great either. In the short term, clobber all information in these leaves. This is a giant bodge, but there are complexities with implementing all of these leaves properly. Fixes: 1a91425 ("x86/cpuid: support LFENCE always serialising CPUID bit") Link: QubesOS/qubes-issues#7392 Reported-by: fosslinux <fosslinux@aussies.space> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> master commit: d4012d5 master date: 2022-04-07 11:36:45 +0100
Automated announcement from builder-github The package
|
Automated announcement from builder-github The component
|
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
|
Automated announcement from builder-github The component
Or update dom0 via Qubes Manager. |
Qubes OS release
4.1
Brief summary
After my most recent set of updates, syncthing does not start in any of my VMs. This is not unique to any one distribution. I can reproduce on Fedora, Debian, and even my unupdated Void Linux StandaloneVM. It previously worked fine (regression).
I still have a system that I have not updated and syncthing works perfectly on. I will attempt to track down the culprit update.
Debugging using delve seems to track down the hang to a CPUID call in https://github.com/klauspost/cpuid/blob/master/cpuid_amd64.s#L9.
I have tried:
I am yet to try:
Steps to reproduce
Expected behavior
Syncthing starts.
Actual behavior
Syncthing does not start.
The text was updated successfully, but these errors were encountered: