-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
activate_mapping: report explicit errors on failure #266
Conversation
Make sure to report the exact reason why the affinity cannot be enforced. No need to call fflush() since it is called internally in fclose(). Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2184735 Signed-off-by: Robin Jarry <rjarry@redhat.com>
} | ||
fclose(file); | ||
info->moved = 0; /*migration is done*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the code change, only non-error irq's info->moved is assigned 0. Previously all info->moved which reached here will be 0. I'm unsure of the correctness. The rest of the patch is good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I would assume that not marking the IRQ as "moved" uppon error is correct. However, there may be side effects with doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no record of the irq being moved, as the actual state of the irqs will be re-determined next time /proc/interrupts is parsed. The only question is, should we ban an irq just because we werent able to set its affinity. APIC's being out of space is a transient issue, banning an irq is permanent for the life of the process. This patch looks fine, but I'd just log the error as an administrative warning, and let irqbalance try again later by not calling add_banned_irq and remove_one_irq_from_db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I should do a separate commit that removes add_banned_irq()
and remove_one_irq_from_db()
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be ok with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But isn't that a direct revert of this commit? 55c5c32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have adjusted the last commit to avoid banning the IRQ if fprintf or fclose return ENOSPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, save for the log message indicating that affinity setting went wrong, which I think should be enough to inform an admin that there was a transient error, or that they need to take action to be more direct in their affinity setting policy. Or do you disagree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On aarch64, there are interrupts that are local to every CPU and that cannot be pinned elsewhere. Here is such an example on my raspberry pi 4B:
LibreELEC:~ # for i in 15 16 17 22 23 24 25 26 27; do echo $i $(basename /proc/irq/$i/*/) $(cat /proc/irq/$i/smp_affinity_list); done
15 DMA IRQ 0-3
16 DMA IRQ 0-3
17 DMA IRQ 0-3
22 DMA IRQ 0-3
23 DMA IRQ 0-3
24 arm-pmu 0
25 arm-pmu 1
26 arm-pmu 2
27 arm-pmu 3
LibreELEC:~ # echo 2 > /proc/irq/15/smp_affinity_list
LibreELEC:~ # echo 2 > /proc/irq/24/smp_affinity_list
sh: write error: Input/output error
Reading again through commit 55c5c32, it seems like the goal was to prevent such errors from happening repeatedly. But it had unfortunate side-effects on x86_64.
If smp_affinity cannot be enforced on a given IRQ because the kernel returns an ENOSPC error, do not ban it for the rest of the process' life. Only log the warning and let irqbalance try again later. If the kernel returns any other error (like EIO when trying to change affinity for PPI interrupts on aarch64), add it to ban list. Fixes: 55c5c32 ("arm64: Add irq aff change check For aarch64 ...") Signed-off-by: Robin Jarry <rjarry@redhat.com>
Once #268 is applied, I'll refresh this. |
I put everything in #269 |
Make sure to report the exact reason why the affinity cannot be enforced. No need to call fflush() since it is called internally in fclose().
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2184735
This is a follow up on #265