Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncthing startup hang #7392

Closed
fosslinux opened this issue Mar 30, 2022 · 21 comments · Fixed by QubesOS/qubes-vmm-xen#130
Closed

Syncthing startup hang #7392

fosslinux opened this issue Mar 30, 2022 · 21 comments · Fixed by QubesOS/qubes-vmm-xen#130
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: Xen diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.1-bookworm-stable r4.1-bullseye-stable r4.1-buster-stable r4.1-centos-stream8-stable r4.1-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@fosslinux
Copy link

fosslinux commented Mar 30, 2022

Qubes OS release

4.1

Brief summary

After my most recent set of updates, syncthing does not start in any of my VMs. This is not unique to any one distribution. I can reproduce on Fedora, Debian, and even my unupdated Void Linux StandaloneVM. It previously worked fine (regression).

I still have a system that I have not updated and syncthing works perfectly on. I will attempt to track down the culprit update.

Debugging using delve seems to track down the hang to a CPUID call in https://github.com/klauspost/cpuid/blob/master/cpuid_amd64.s#L9.

I have tried:

  • changing the kernel version

I am yet to try:

  • downgrading any packages from dom0
  • just calling CPUID to see if CPUID is borked

Steps to reproduce

  1. Fully update a Qubes OS system.
  2. Install syncthing in any AppVM.
  3. Attempt to run syncthing and note the hang.

Expected behavior

Syncthing starts.

Actual behavior

Syncthing does not start.

@fosslinux fosslinux added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Mar 30, 2022
@andrewdavidwong andrewdavidwong added C: other needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Mar 30, 2022
@andrewdavidwong andrewdavidwong added this to the Release 4.1 updates milestone Mar 30, 2022
@DemiMarie
Copy link

Can you get a backtrace from the hang? Does the hang appear to be at the CPUID instruction itself (as if the instruction never finishes)? Can the hung process be killed? Does a simple program that calls CPUID manually work?

CPUID is intercepted by Xen and emulated in the hypervisor, so this could be a Xen bug.

@fosslinux
Copy link
Author

Here is the full backtrace and code around the hang using delve (go debugger):

$ dlv exec ./syncthing 
Type 'help' for list of commands.
(dlv) r
Process restarted with PID 16155
(dlv) c
received SIGINT, stopping process (will not forward signal)
> github.com/klauspost/cpuid/v2.asmCpuidex() /home/user/go/pkg/mod/github.com/klauspost/cpuid/v2@v2.0.9/cpuid_amd64.s:21 (PC: 0x85cfea)
    16:	// func asmCpuidex(op, op2 uint32) (eax, ebx, ecx, edx uint32)
    17:	TEXT ·asmCpuidex(SB), 7, $0
    18:		MOVL op+0(FP), AX
    19:		MOVL op2+4(FP), CX
    20:		CPUID
=>  21:		MOVL AX, eax+8(FP)
    22:		MOVL BX, ebx+12(FP)
    23:		MOVL CX, ecx+16(FP)
    24:		MOVL DX, edx+20(FP)
    25:		RET
    26:	
(dlv) bt
 0  0x000000000085cfea in github.com/klauspost/cpuid/v2.asmCpuidex
    at /home/user/go/pkg/mod/github.com/klauspost/cpuid/v2@v2.0.9/cpuid_amd64.s:21
 1  0x000000000085cdaa in github.com/klauspost/cpuid/v2.asmCpuidex
    at :0
 2  0x0000000000858525 in github.com/klauspost/cpuid/v2.(*CPUInfo).cacheSize
    at /home/user/go/pkg/mod/github.com/klauspost/cpuid/v2@v2.0.9/cpuid.go:765
 3  0x000000000085c396 in github.com/klauspost/cpuid/v2.addInfo
    at /home/user/go/pkg/mod/github.com/klauspost/cpuid/v2@v2.0.9/detect_x86.go:33
 4  0x0000000000856505 in github.com/klauspost/cpuid/v2.Detect
    at /home/user/go/pkg/mod/github.com/klauspost/cpuid/v2@v2.0.9/cpuid.go:257
 5  0x000000000085643c in github.com/klauspost/cpuid/v2.init.0
    at /home/user/go/pkg/mod/github.com/klauspost/cpuid/v2@v2.0.9/cpuid.go:236
 6  0x00000000004489c7 in runtime.doInit
    at /usr/lib/go/src/runtime/proc.go:6498
 7  0x0000000000448919 in runtime.doInit
    at /usr/lib/go/src/runtime/proc.go:6475
 8  0x0000000000448919 in runtime.doInit
    at /usr/lib/go/src/runtime/proc.go:6475
 9  0x0000000000448919 in runtime.doInit
    at /usr/lib/go/src/runtime/proc.go:6475
10  0x000000000043b754 in runtime.main
    at /usr/lib/go/src/runtime/proc.go:238
11  0x000000000046d041 in runtime.goexit
    at /usr/lib/go/src/runtime/asm_amd64.s:1581
(dlv) 

I believe that it appears to be on line 21 as the CPUID instruction is executed, incrementing the line counter, but it never finishes, causing the hang. At least that is how it seems.

Yes, the hung process can be killed.

However, a simple assembly program calling CPUID does not hang, which is pretty strange.
The assembly provided as a more indepth example on the Wikipedia page for CPUID also returns correct values...

@DemiMarie
Copy link

Yes, the hung process can be killed.

That seems to indicate that whatever is going on is not the process being stuck in Xen, since that wouldn’t be killable.

However, a simple assembly program calling CPUID does not hang, which is pretty strange.
The assembly provided as a more indepth example on the Wikipedia page for CPUID also returns correct values...

Can you try calling manually CPUID with the exact same parameters as the failing call? This seems like a strange low-level bug in either Xen or Linux. Would you by any chance be able to see what happens in kernel mode when the CPUID instruction is issued?

@fosslinux
Copy link
Author

fosslinux commented Apr 5, 2022

Further investigation has revealed (not sure how I didn't pick up some of this earlier);

At this loop, this loop never ends, creating the illusion that CPUID never ends (for somereason whenever I CTRL-Cd it always stopped around CPUID, probably due to those instructions being most intensive).

On my AMD Ryzen 1600AF (aka basically 2600), the cpuidex call (CPUID extended, includes an extra argument), no matter the value of the second argument (which is what is being incremented), it returns the exact same values. I think this is incorrect but I cannot find data to back that up either way.

On my Intel i5-7200U laptop, after complete system upgrades, I cannot reproduce this issue. Seems to be limited to AMD systems? (I don't have more AMD hardware to test on).

In my second Qubes install for testing things (on a separate drive), I can reproduce this problem, but NOT if I run the syncthing binary in dom0 (any casual readers, please do not do this, the security implications are possible complete system compromise).

When I next have some spare time to investigate this, I will see whether the same call of CPUID in dom0 and a qube produces the same output. (Hypothesis: It will not).

@marmarek
Copy link
Member

marmarek commented Apr 5, 2022

On my AMD Ryzen 1600AF (aka basically 2600), the cpuidex call (CPUID extended, includes an extra argument), no matter the value of the second argument (which is what is being incremented), it returns the exact same values.

Can you say what specifically values are returned there?

@fosslinux
Copy link
Author

Can you say what specifically values are returned there?

Values returned:
	eax: 16673
	ebx: 29360191
	ecx: 63
	edx: 0

@andyhhp
Copy link

andyhhp commented Apr 5, 2022

This is a bug in Xen with how we derive CPUID data for guests. Leaf 0x8000001d shouldn't be exposed because it's not yet tied into the topology logic (the number of CPUs sharing this cache needs adjusting based on the number of vCPUs the VM has.)

@fosslinux
Copy link
Author

Ah, makes sense, but how come this worked in the past then?

@andyhhp
Copy link

andyhhp commented Apr 6, 2022

It will be a consequence of https://github.com/QubesOS/qubes-vmm-xen/blob/xen-4.14/patch-stable-0002-x86-cpuid-support-LFENCE-always-serialising-CPUID-bi.patch being backported for speculation reasons. Before that patch, 0x8000001c was reported as the max extended leaf, so userspace wouldn't have gone looking at 0x8000001d.

@andyhhp
Copy link

andyhhp commented Apr 6, 2022

Actually, there is also a bug in whatever piece of userspace is looping here. The AMD manual states that the contents of 0x8000001d and 0x8000001e are only valid if the TOPOEXT feature (CPUID.80000001.edx[22]) is visible. Xen hides this feature, so userspace shouldn't go looking.

@DemiMarie
Copy link

Actually, there is also a bug in whatever piece of userspace is looping here. The AMD manual states that the contents of 0x8000001d and 0x8000001e are only valid if the TOPOEXT feature (CPUID.80000001.edx[22]) is visible. Xen hides this feature, so userspace shouldn't go looking.

Time to be mean and inject #GP into the buggy guest userspace? (Only half-joking.)

@andyhhp
Copy link

andyhhp commented Apr 6, 2022

@fosslinux Any chance I can have a name & email address for a Reported-by tag on the upstream fix?

@fosslinux
Copy link
Author

I don't use my real name (at least at this time). So if fosslinux <fosslinux@aussies.space> is OK, please use that.

andyhhp added a commit to andyhhp/xen that referenced this issue Apr 7, 2022
c/s 1a91425 increased the AMD max leaf from 0x8000001c to 0x80000021, but
did not adjust anything in the calculate_*_policy() chain.  As a result, on
hardware supporting these leaves, we read the real hardware values into the
raw policy, then copy into host, and all the way into the PV/HVM default
policies.

All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
next by PQOS), so any software following the rules is fine and will leave them
alone.  However, leaf 0x8000001d takes a subleaf input and at least two
userspace utilities have been observed to loop indefinitely under Xen (clearly
waiting for eax to report "no more cache levels").

Such userspace is buggy, but Xen's behaviour isn't great either.

In the short term, clobber all information in these leaves.  This is a giant
bodge, but there are complexities with implementing all of these leaves
properly.

Fixes: 1a91425 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Link: QubesOS/qubes-issues#7392
Reported-by: fosslinux <fosslinux@aussies.space>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
@emacsomancer
Copy link

Is there any current workaround for this issue?

@DemiMarie
Copy link

@marmarek time to pop out a new Xen package?

olafhering pushed a commit to olafhering/xen that referenced this issue Apr 28, 2022
c/s 1a91425 increased the AMD max leaf from 0x8000001c to 0x80000021, but
did not adjust anything in the calculate_*_policy() chain.  As a result, on
hardware supporting these leaves, we read the real hardware values into the
raw policy, then copy into host, and all the way into the PV/HVM default
policies.

All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
next by PQOS), so any software following the rules is fine and will leave them
alone.  However, leaf 0x8000001d takes a subleaf input and at least two
userspace utilities have been observed to loop indefinitely under Xen (clearly
waiting for eax to report "no more cache levels").

Such userspace is buggy, but Xen's behaviour isn't great either.

In the short term, clobber all information in these leaves.  This is a giant
bodge, but there are complexities with implementing all of these leaves
properly.

Fixes: 1a91425 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Link: QubesOS/qubes-issues#7392
Reported-by: fosslinux <fosslinux@aussies.space>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d4012d5
master date: 2022-04-07 11:36:45 +0100
olafhering pushed a commit to olafhering/xen that referenced this issue Apr 28, 2022
c/s 1a91425 increased the AMD max leaf from 0x8000001c to 0x80000021, but
did not adjust anything in the calculate_*_policy() chain.  As a result, on
hardware supporting these leaves, we read the real hardware values into the
raw policy, then copy into host, and all the way into the PV/HVM default
policies.

All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
next by PQOS), so any software following the rules is fine and will leave them
alone.  However, leaf 0x8000001d takes a subleaf input and at least two
userspace utilities have been observed to loop indefinitely under Xen (clearly
waiting for eax to report "no more cache levels").

Such userspace is buggy, but Xen's behaviour isn't great either.

In the short term, clobber all information in these leaves.  This is a giant
bodge, but there are complexities with implementing all of these leaves
properly.

Fixes: 1a91425 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Link: QubesOS/qubes-issues#7392
Reported-by: fosslinux <fosslinux@aussies.space>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d4012d5
master date: 2022-04-07 11:36:45 +0100
olafhering pushed a commit to olafhering/xen that referenced this issue Apr 28, 2022
c/s 1a91425 increased the AMD max leaf from 0x8000001c to 0x80000021, but
did not adjust anything in the calculate_*_policy() chain.  As a result, on
hardware supporting these leaves, we read the real hardware values into the
raw policy, then copy into host, and all the way into the PV/HVM default
policies.

All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
next by PQOS), so any software following the rules is fine and will leave them
alone.  However, leaf 0x8000001d takes a subleaf input and at least two
userspace utilities have been observed to loop indefinitely under Xen (clearly
waiting for eax to report "no more cache levels").

Such userspace is buggy, but Xen's behaviour isn't great either.

In the short term, clobber all information in these leaves.  This is a giant
bodge, but there are complexities with implementing all of these leaves
properly.

Fixes: 1a91425 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Link: QubesOS/qubes-issues#7392
Reported-by: fosslinux <fosslinux@aussies.space>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d4012d5
master date: 2022-04-07 11:36:45 +0100
@qubesos-bot
Copy link

Automated announcement from builder-github

The package vmm-xen has been pushed to the r4.1 testing repository for the CentOS centos-stream8 template.
To test this update, please install it with the following command:

sudo yum update --enablerepo=qubes-vm-r4.1-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component vmm-xen (including package python3-xen-4.14.5-1.fc32) has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package xen_4.14.5-1 has been pushed to the r4.1 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing buster-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@andrewdavidwong andrewdavidwong added diagnosed Technical diagnosis has been performed (see issue comments). and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels May 4, 2022
@qubesos-bot
Copy link

Automated announcement from builder-github

The package vmm-xen has been pushed to the r4.1 stable repository for the CentOS centos-stream8 template.
To install this update, please use the standard update command:

sudo yum update

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package xen_4.14.5-4+deb10u1 has been pushed to the r4.1 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component vmm-xen (including package python3-xen-4.14.5-4.fc32) has been pushed to the r4.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: Xen diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.1-bookworm-stable r4.1-bullseye-stable r4.1-buster-stable r4.1-centos-stream8-stable r4.1-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants