-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
irqbalance not working on pa-risc? #159
Comments
I see how this would fix both problems, and I'm ok with the EOVERFLOW fix, but I'm a bit lost on the sysfs fix. By all rights the online attribute should exist for parisc cpus (from what I can see in the kernel code, that attribute is arch agnostic), and so should be there. Instead of papering over the problem to avoid using it, could you please look into why parisc systems don't present that file? If you want to open a PR for the EOVERFLOW issue, I'll gladly pull that |
I noticed that on the parisc system I have access to there is actually only one CPU available:
despite the kernel being SMP:
and the system apparently having multiple CPUs:
This could be the reason why the |
you know what, it probably is. Given that sysfs show 8 cpus, but you only have one physical cpu, thats likely the result of sysfs getting populated based on the kernels cpu_possible mask (generated from the NR_CPUS configuration variable I think). And they don't list as online because they don't actually exists. As for cpu0, its the boot processor so it always has to be online. Given that, I think the right thing to do here is:
I don't think we should rely on hotplug/state, as there is no guarantee that file will always be their either (i.e. if its not configured into the kernel). Better to just assume cpuN is offline if the online attribute doesn't exist (where N != 0) |
On 26.08.20 15:26, Neil Horman wrote:
1. Assume that if /sys/devices/system/cpu/cpu/online doesn't exist, we should treat the cpu as unavailable for balancing
I just checked on a 4-way SMP machine
root@phantom:~# ls -la /sys/devices/system/cpu/cpu3/
hotplug/ subsystem/ topology/ uevent
That means */online doesn't exists for any online CPU.
Helge
|
I'm sorry, can you clarify? I presume this is a parisc system you are looking at? If the system is truly a 4-way smp system, it would seem that cpu1 cpu2 and cpu3 should have the online attribute (cpu0 being default online, since its the bsp). If the attribute doesn't exist and the system is truly smp, it seems that the sysfs attributes in the kernel here have a bug |
On 26.08.20 17:26, Neil Horman wrote:
I'm sorry, can you clarify? I presume this is a parisc system you are looking at? If the system is truly a 4-way smp system,
Yes, it's a real 4-way SMP parisc machine.
I've given Paride Legovini access to it as well, so I assume he will follow up soon too.
it would seem that cpu1 cpu2 and cpu3 should have the online attribute (cpu0 being default online, since its the bsp).
If the attribute doesn't exist and the system is truly smp, it seems that the sysfs attributes in the kernel here have a bug
I haven't checked the kernel yet, but I don't think the attributes/code in the kernel has a bug.
Many of the sysfs entries depend on the kernel configuration, e.g. if CPU_HOTPLUG is enabled.
For parisc for example we didn't yet implemented CPU hotplug, which could explain that the online entry isn't visible.
If you check my patch, you can see that I implemented the reading of "/hotplug/state" only as fallback option in case the "/online" file isn't available.
That should IMHO be safe for any kernels.
Helge
|
I understand what you're saying, but that really doesn't give me additional confidence regarding your fix:
I really think (2) is what we need to figure out here. By all rights that attribute should be there, and it isn't. Either that or we need a solid explanation of why it doesn't exist |
Even on x86_64, e.g. kernel 4.19 the "online" entry isn't there:
$ uname -a
Linux ls3530 4.19.134-300.fc30.x86_64 #1 SMP Wed Jul 22 16:10:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Fedora release 30 (Thirty)
$ ls -la /sys/devices/system/cpu/cpu0/online
ls: cannot access '/sys/devices/system/cpu/cpu0/online': No such file or directory
$ ls -la /sys/devices/system/cpu/cpu0/hotplug/state
-r--r--r--. 1 root root 4096 26. Aug 18:41 /sys/devices/system/cpu/cpu0/hotplug/state
$ ls -la /sys/devices/system/cpu/cpu0
drwxr-xr-x. 10 root root 0 25. Aug 00:22 .
drwxr-xr-x. 14 root root 0 25. Aug 00:22 ..
drwxr-xr-x. 2 root root 0 26. Aug 18:39 acpi_cppc
drwxr-xr-x. 7 root root 0 25. Aug 00:23 cache
lrwxrwxrwx. 1 root root 0 25. Aug 00:24 cpufreq -> ../cpufreq/policy0
drwxr-xr-x. 11 root root 0 26. Aug 18:39 cpuidle
-r--------. 1 root root 4096 26. Aug 18:39 crash_notes
-r--------. 1 root root 4096 26. Aug 18:39 crash_notes_size
lrwxrwxrwx. 1 root root 0 26. Aug 18:39 driver -> ../../../../bus/cpu/drivers/processor
lrwxrwxrwx. 1 root root 0 26. Aug 18:39 firmware_node -> ../../../LNXSYSTM:00/LNXCPU:00
drwxr-xr-x. 2 root root 0 26. Aug 18:39 hotplug
drwxr-xr-x. 2 root root 0 26. Aug 18:39 microcode
lrwxrwxrwx. 1 root root 0 26. Aug 18:39 node0 -> ../../node/node0
drwxr-xr-x. 2 root root 0 26. Aug 18:39 power
lrwxrwxrwx. 1 root root 0 25. Aug 00:22 subsystem -> ../../../../bus/cpu
drwxr-xr-x. 2 root root 0 26. Aug 18:39 thermal_throttle
drwxr-xr-x. 2 root root 0 25. Aug 00:23 topology
-rw-r--r--. 1 root root 4096 26. Aug 18:39 uevent
On parisc:
# uname -a
Linux phantom.physik.fu-berlin.de 5.7.0-3-parisc64 #1 SMP Debian 5.7.17-1 (2020-08-25) parisc64 GNU/Linux
# ls -la /sys/devices/system/cpu/cpu0
drwxr-xr-x 4 root root 0 26. Aug 12:32 .
drwxr-xr-x 12 root root 0 26. Aug 12:29 ..
drwxr-xr-x 2 root root 0 26. Aug 2020 hotplug
lrwxrwxrwx 1 root root 0 26. Aug 2020 subsystem -> ../../../../bus/cpu
drwxr-xr-x 2 root root 0 26. Aug 2020 topology
-rw-r--r-- 1 root root 4096 26. Aug 2020 uevent
Still checking...
|
yes, as I noted here, it woudl seem the bsp never gets an online attribute (possibly a bug, possibly intentional as the bsp should never be taken offline). The other non-bsp cpus will have online attributes however, on x86 systems (or arm/power systems, as far as I'm able to tell). Its just parisc that doesn't, which seems wrong to me. |
In drivers/base/cpu.c:register_cpu()
the value of "cpu->hotpluggable" is stored to cpu->dev.offline_disabled,
which then is checked in drivers/base/core.c:
device_supports_offline(dev) && !dev->offline_disabled
and depending on it the "online" entry seems to be created.
This explains why CPU0 never has the "online" entry, while
other CPUs on x86 has that file.
Since we don't yet support CPU hotplugging on parisc, I can not
set cpu->hotpluggable=1 yet. Same seems to apply to other
architectures which don't support hotplug in general.
What's your suggestion?
|
If you boot with "cpu0_hotplug=y" or "on" you should see the online
parameter.
In arch/x86/kernel/topology.c, the compile option
CONFIG_BOOTPARAM_HOTPLUG_CPU0 can set cpu0_hotpluggable to 1. I think that
triggers the online attribute later to be displayed or not.
…On Wed, Aug 26, 2020 at 9:52 AM Neil Horman ***@***.***> wrote:
yes, as I noted here
<#159 (comment)>,
it woudl seem the bsp never gets an online attribute (possibly a bug,
possibly intentional as the bsp should never be taken offline). The other
non-bsp cpus will have online attributes however, on x86 systems (or
arm/power systems, as far as I'm able to tell). Its just parisc that
doesn't, which seems wrong to me.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#159 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIBQTSYM2EXV3CA26NFYLLSCU4VHANCNFSM4QKOB46Q>
.
|
I think @ppwaskie suggestion is a good one, at least to confirm that we now understand why and when the online attribute becomes present As for the default state of a cpu, I still think we need to find a way to either make some assumptions surrounding what the lack of availability means for cpu presence (i.e. for non-bsp cpus, does the lack of an online attribute imply no presence?). Alternatively we have to find a way to definitively determine if a cpu is present or not (the topology directory is present on both arches, can you check to see if core_id is set to -1 or some such for cpus that aren't present? |
The machine has two Dual-Core CPUs (2 Cores, with 2 CPUs each):
root@phantom:/sys/devices/system/cpu/cpu0/topology# grep . *
core_cpus:01
core_cpus_list:0
core_id:0
core_siblings:03
core_siblings_list:0-1
die_cpus:01
die_cpus_list:0
die_id:-1
package_cpus:03
package_cpus_list:0-1
physical_package_id:0
thread_siblings:01
thread_siblings_list:0
root@phantom:/sys/devices/system/cpu/cpu2/topology# grep . *
core_cpus:04
core_cpus_list:2
core_id:0
core_siblings:0c
core_siblings_list:2-3
die_cpus:04
die_cpus_list:2
die_id:-1
package_cpus:0c
package_cpus_list:2-3
physical_package_id:1
thread_siblings:04
thread_siblings_list:2
|
sure you can email me at nhorman@tuxdriver.com |
I'm assuming this is the PARISC system?
I have an older, non-Xeon x86-64 box that doesn't support CPU hotplug
(i7-4770k) that shows the "online" attribute for all my CPUs, except for
CPU0. I am not surprised by that, since as Neil said a few times, the BSP
is assumed here to always be online.
The Xeon I have has online in CPU0 only after I enable CPU0 hotplug
support, both through a kernel rebuild, or by using the kernel boot
parameter to enable it.
…-PJ
On Wed, Aug 26, 2020 at 10:35 AM Helge Deller ***@***.***> wrote:
The machine has two Dual-Core CPUs (2 Cores, with 2 CPUs each):
***@***.***:/sys/devices/system/cpu/cpu0/topology# grep . *
core_cpus:01
core_cpus_list:0
core_id:0
core_siblings:03
core_siblings_list:0-1
die_cpus:01
die_cpus_list:0
die_id:-1
package_cpus:03
package_cpus_list:0-1
physical_package_id:0
thread_siblings:01
thread_siblings_list:0
***@***.***:/sys/devices/system/cpu/cpu2/topology# grep . *
core_cpus:04
core_cpus_list:2
core_id:0
core_siblings:0c
core_siblings_list:2-3
die_cpus:04
die_cpus_list:2
die_id:-1
package_cpus:0c
package_cpus_list:2-3
physical_package_id:1
thread_siblings:04
thread_siblings_list:2
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#159 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIBQTUZHRKXG4HBPWBO2F3SCVBUZANCNFSM4QKOB46Q>
.
|
On the 4-core parisc system we have for
while on the single core one we have:
So a non-empty |
So a non-empty |core_id| does seem a good indicator of presence. And it looks like that nonempty |core_id| and missing |online| attribute indicates an online cpu.
Just bringing in my patch above again... It's an indicator as well:
root@phantom:/sys/devices/system/cpu# grep . /sys/devices/system/cpu/cpu*/hotplug/state
/sys/devices/system/cpu/cpu0/hotplug/state:210
/sys/devices/system/cpu/cpu1/hotplug/state:210
/sys/devices/system/cpu/cpu2/hotplug/state:210
/sys/devices/system/cpu/cpu3/hotplug/state:210
/sys/devices/system/cpu/cpu4/hotplug/state:0
/sys/devices/system/cpu/cpu5/hotplug/state:0
/sys/devices/system/cpu/cpu6/hotplug/state:0
/sys/devices/system/cpu/cpu7/hotplug/state:0
|
Except on my Fedora 32 x86_64 system, i don't have a state atrribute for any of my cpus, so that seems less than reliable as well |
but |
Except on my Fedora 32 x86_64 system, i don't have a state atrribute for any of my cpus, so that seems less than reliable as well
Yes, but in that case you have the "online" attribute which could have been checked before the "state" attribute.
Based on the current findings and given the variety of CONFIG options in the kernel, I think you need a priority chain: test X first, if not fallback to Y, and if Y doesn't exist then try Z.
Anyway, finally I'm fine with any solution you prefer (as long as it works).
Helge
|
I'd really like to avoid that if I could, just to keep the code simple if possible. In that vien, I just noticed something. On my x86_64 system, there is a sysfs file /proc/devices/system/cpu/online. It offers an inclusion list in the format N-M, indicating which processors are online. Can you check your parisc system to see if it exists there as well? If so, perhaps that is a canonical way to determine which cpus are online accross arches and kernel configs. |
On my x86_64 system with 8 cores (threads):
On a single-core parisc machine:
On a 4-core parisc machine:
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu says that |
On a PPC64 machine with SMT disabled:
Tested also on arm64 and s390x, it's always consistent. |
Ok, thats excellent news. I think we have parsing code for that format of file as well, and it sounds like that specific online file is cross arch and cross config. I'll write up a patch today |
#159 recently brought to our attention that online cpu status isn't functional on all arches. Specifically on parisc, the availability of /sys/devices/system/cpu/cpu<N>/online is in question. The implication here is that its not feasible to accurately determine cpu count, and as a result, irqbalance doesn't work on that arch Fix it by changing our online detection strategy. The file /sys/devices/system/cpu/online is a cpulist format file that seems to be present accross all arches and configs. As such, we can use this file to determine online status per cpu reliably. Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
https://github.com/Irqbalance/irqbalance/tree/cpuonline Everyone give that a shot, and let me know if it works for you. It seems to work well on my x86_64 system |
This is what I get on the 4-way PARISC box. root@phantom:/home/deller/git/irqbalance# ./irqbalance -f -d -o |
can you attach the entire log? We want to see if we get 4 cpus that we can balance too |
Actually, scratch that, the Cache domain dump seems to have correctly shown 4 unique cpu masks, so I think this is working. I'll merge it shortly |
Fixed as per #163 |
A bit late as the PR landed already, but here is the output on a POWER9 machine with SMP disabled. This is a bug I'm really glad I reported, thanks to both of you!
|
Sorry, I should have waited, but the power run looks correct to me, as does my local run on x86_64 |
Hi, maintainer of the irqbalance Debian package here. The package currently ships irqbalance 1.6.0 with the following patch applied (not written by me):
I am trying to understand the issues are still present in v1.7.0 and, if this is the case, if the patch should be picked up in this git repo. IIUC there are two issues:
/sys/devices/system/cpu/cpu0/online
are missing on PA-RISC systems. This indeed seems to be the case:Should irqbalance fallback to
/sys/devices/system/cpu/cpu0/hotplug/state
in this case, as the patch does?EOVERFLOW
when echoing0xfffffff
to/proc/irq/100/smp_affinity
. This does't seems strictly related to pa-risc, but the problem doesn't happen on my x86_64 system. Perhaps this is fixed already?I am not root on the pa-risc system I mentioned to I can't fully test irqbalance there. Let me know if you need any other bit of information. Thanks!
The text was updated successfully, but these errors were encountered: