Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.9.x hanging network adapter inside AARCH64 VM #252

Closed
peterneutron opened this issue Nov 21, 2022 · 9 comments
Closed

1.9.x hanging network adapter inside AARCH64 VM #252

peterneutron opened this issue Nov 21, 2022 · 9 comments

Comments

@peterneutron
Copy link

Host: M1 Mac mini
Host OS: macOS 13.0.1 (22A400)
QEMU: 7.1.0

Guest: Arch Linux ARM (virtualized)
Kernel: 5.19.8
Network: Bridged (2 interfaces)

Affected irqbalance Version: => 1.9.0
Last working irqbalance Version: =< 1.8.0

Summary: Every version of irqbalance => 1.9.0 hangs one of two interfaces at random in an arbitrary timeframe.

Steps taken: Cross checked with different combinations of QEMU and the kernel, issue still persists. Checked service/systemd/network/kernel logs but couldn't make out any related entries.

I know this is a niche case and my ability to debug this are limited but maybe someone is able to point me in the right direction.

@nhorman
Copy link
Member

nhorman commented Nov 21, 2022

I would recommend not running irqbalance inside a VM. Normally it should be fine, but depending on how you have CPU pinning configured from your host to your guest, its possible that your guest may affine an interrupt to a physical CPU that isn't mapped into your guest cpus at all, leading to a loss of softirq handling, and the hang you describe. Just let the host handle interrupt affining

@peterneutron
Copy link
Author

peterneutron commented Nov 21, 2022

Thanks for the advise. The base M1 has 8-cores total split in 4 High Performance and 4 Energy Efficiency ones. QEMU exposes the 4 HP-cores only. Is it still possible the VM tries to access the other four? And why wouldn't it cause any problems in any version below 1.9.x?

@nhorman
Copy link
Member

nhorman commented Nov 21, 2022

Honestly, I don't know, I've never looked at qemu on macos before. But I certainly can't diagnose a problem in irqbalance when the symptom doesn't occur in irqbalance. Long story short, all irqbalance does is write affinity values to the proc/irq/ directories. If you stopped irqbalance and wrote the same affinity values that irqbalance is writing to those proc files, you should get the same hang behavior - i.e. this isn't going to be a problem I can fix.

As for why it only happens on post v1.9.0 version, I suspect it was because of several fixes that went into the balancing algorithm, which prior to 1.9.0 led to several irqs on various non-x86 arches never getting selected for rebalancing

@peterneutron
Copy link
Author

Thanks for taking your time with this. I will close this now and will report back if anything comes up on my end.

@peterneutron
Copy link
Author

@nhorman I found the commit that causes this issue on my setup and to my surprise it seems to have nothing to do with the AARCH64 related commits since 1.8.0.

Commit: 2a66a66

What I did was essentialy building irqbalance 1.8.0 with every commit one by one up until 1.9.2. At the moment I'am running 1.9.2 with only this commit removed which seems to have fixed the hanging network adapters. Can you make any sense of this?

@nhorman
Copy link
Member

nhorman commented Nov 22, 2022

That has everything to do with with AARCH64, in that the code you are referring to affects all arches. As I noted above, prior to that change, several irqs were never getting selected for rebalancing, which hid hang from you

@peterneutron
Copy link
Author

Then why doesn't it happen when I disable irqbalance and manually change affinity via /proc/irq?

@nhorman
Copy link
Member

nhorman commented Nov 22, 2022

I don't know @peterneutron , but its not something I'm going to be able to help you with. Irqbalance's interface to the kernel is exactly the same as the one you are writing to manually. There may be a timing issue at play here that triggers the hang, for which you can use irqbalance to reproduce, but if your system is hanging as a result of whatever that magic order of operations is, the root cause, cannot be irqbalance. If the adapter in question stops responding to interrupts, you're going to need to instrument the kernel driver (or write a systemtap script) to figure out whats going on.

@peterneutron
Copy link
Author

I thank you again and will leave it at that because instrumenting the kernel driver or creating a systemtap is way out of my league.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants