Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

activate_mapping: report explicit errors on failure #266

Closed
wants to merge 2 commits into from

Conversation

rjarry
Copy link
Contributor

@rjarry rjarry commented Jul 5, 2023

Make sure to report the exact reason why the affinity cannot be enforced. No need to call fflush() since it is called internally in fclose().

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2184735

This is a follow up on #265

Make sure to report the exact reason why the affinity cannot be
enforced. No need to call fflush() since it is called internally in
fclose().

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2184735
Signed-off-by: Robin Jarry <rjarry@redhat.com>
}
fclose(file);
info->moved = 0; /*migration is done*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the code change, only non-error irq's info->moved is assigned 0. Previously all info->moved which reached here will be 0. I'm unsure of the correctness. The rest of the patch is good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I would assume that not marking the IRQ as "moved" uppon error is correct. However, there may be side effects with doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no record of the irq being moved, as the actual state of the irqs will be re-determined next time /proc/interrupts is parsed. The only question is, should we ban an irq just because we werent able to set its affinity. APIC's being out of space is a transient issue, banning an irq is permanent for the life of the process. This patch looks fine, but I'd just log the error as an administrative warning, and let irqbalance try again later by not calling add_banned_irq and remove_one_irq_from_db

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I should do a separate commit that removes add_banned_irq() and remove_one_irq_from_db() then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't that a direct revert of this commit? 55c5c32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have adjusted the last commit to avoid banning the IRQ if fprintf or fclose return ENOSPC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, save for the log message indicating that affinity setting went wrong, which I think should be enough to inform an admin that there was a transient error, or that they need to take action to be more direct in their affinity setting policy. Or do you disagree?

Copy link
Contributor Author

@rjarry rjarry Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On aarch64, there are interrupts that are local to every CPU and that cannot be pinned elsewhere. Here is such an example on my raspberry pi 4B:

LibreELEC:~ # for i in 15 16 17 22 23 24 25 26 27; do echo $i $(basename /proc/irq/$i/*/) $(cat /proc/irq/$i/smp_affinity_list); done
15 DMA IRQ 0-3
16 DMA IRQ 0-3
17 DMA IRQ 0-3
22 DMA IRQ 0-3
23 DMA IRQ 0-3
24 arm-pmu 0
25 arm-pmu 1
26 arm-pmu 2
27 arm-pmu 3
LibreELEC:~ # echo 2 > /proc/irq/15/smp_affinity_list 
LibreELEC:~ # echo 2 > /proc/irq/24/smp_affinity_list 
sh: write error: Input/output error

Reading again through commit 55c5c32, it seems like the goal was to prevent such errors from happening repeatedly. But it had unfortunate side-effects on x86_64.

If smp_affinity cannot be enforced on a given IRQ because the kernel
returns an ENOSPC error, do not ban it for the rest of the process'
life.

Only log the warning and let irqbalance try again later.

If the kernel returns any other error (like EIO when trying to change
affinity for PPI interrupts on aarch64), add it to ban list.

Fixes: 55c5c32 ("arm64: Add irq aff change check For aarch64 ...")
Signed-off-by: Robin Jarry <rjarry@redhat.com>
@rjarry
Copy link
Contributor Author

rjarry commented Jul 10, 2023

Once #268 is applied, I'll refresh this.

@rjarry rjarry mentioned this pull request Jul 12, 2023
@rjarry
Copy link
Contributor Author

rjarry commented Jul 12, 2023

I put everything in #269

@rjarry rjarry closed this Jul 12, 2023
@rjarry rjarry deleted the report-errors branch July 12, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants