Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Affinity mapping fixes #269

Merged
merged 4 commits into from
Jul 13, 2023
Merged

Affinity mapping fixes #269

merged 4 commits into from
Jul 13, 2023

Conversation

rjarry
Copy link
Contributor

@rjarry rjarry commented Jul 12, 2023

Hi,

This series is a follow up on #265 #266 #267 and #268 and addresses this issue: https://bugzilla.redhat.com/show_bug.cgi?id=2184735

The first commit fixes a potential use-after-free error introduced by commit 55c5c32

The second and third patches aim at better error reporting when an IRQ affinity cannot be set.

The last patch makes sure to only consider that an IRQ affinity cannot be managed on specific cases.

I have done some manual testing and everything looks in order.

@rjarry rjarry changed the title Blacklist Affinity mapping fixes Jul 12, 2023
@nhorman
Copy link
Member

nhorman commented Jul 12, 2023

I think this looks pretty reasonable, will test in the am. Can I ask what testing you have done on it?

@rjarry
Copy link
Contributor Author

rjarry commented Jul 12, 2023

Hi Neil,

Here is a dump of /proc/interrupts and a excerpt of a test log.

https://gist.github.com/rjarry/c465c00d151397da3f9b9e64bd174723

NB: I have set IRQBALANCE_BANNED_CPUS=000000ff,ffcffffc

add_banned_irq appends the irq_info to the banned_irqs list.
remove_one_irq_from_db removes it from the interrupts_db and free()s it.

This leaves an invalid pointer dangling in banned_irqs *and* potentially
in rebalance_irq_list which can cause use-after-free errors.

Do not move the irq_info around. Only add a flag to indicate that this
irq's affinity cannot be managed and ignore the irq when this flag is
set.

Link: Irqbalance#267
Fixes: 55c5c32 ("arm64: Add irq aff change check For aarch64...")
Signed-off-by: Robin Jarry <rjarry@redhat.com>
fprintf() is buffered and may not report an error which may be deferred
when fflush() is called (either explicitly or internally by fclose()).

Check for errors returned by fopen(), fprintf() and fclose() and add
IRQ_FLAG_AFFINITY_UNMANAGED accordingly.

Fixes: 55c5c32 ("arm64: Add irq aff change check For aarch64, ...")
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2184735
Signed-off-by: Robin Jarry <rjarry@redhat.com>
@nhorman
Copy link
Member

nhorman commented Jul 13, 2023

Interesting that several of your irqs report an ENOENT error:
Cannot change IRQ 862 affinity: No such file or directory
I would have thought that the smp_affinity file would still exist and just not be writable, but thats a kernel behavior, so it is what it is.

I did notice one issue in your code (comment marked in the commit), other than that, my (admittedly limited testing here shows it works). Thank you for that. If you fix up the errno issue, I'll merge this

@rjarry
Copy link
Contributor Author

rjarry commented Jul 13, 2023

Hi, thanks for the review and testing. I missed your comment. On what commit did you leave it?

@rjarry
Copy link
Contributor Author

rjarry commented Jul 13, 2023

And yes, I was also surprised about the ENOENT errors. Maybe these IRQs are leftovers not properly cleaned up in procfs by the kernel when adding/removing VFs. The folders do exist but seem to be empty.

@nhorman
Copy link
Member

nhorman commented Jul 13, 2023

Its listed immediately above your last comment.

You're guess could very well be correct, if irq registration is happening while we're processing procfs we may well get that error, in either case I would imagine that your handling of it is correct.

@rjarry
Copy link
Contributor Author

rjarry commented Jul 13, 2023

Hmmm, I am very confused. I don't see your comment...

image

@nhorman
Copy link
Member

nhorman commented Jul 13, 2023

@rjarry
Copy link
Contributor Author

rjarry commented Jul 13, 2023

Did you validate/submit your review?

this is what I see:

image

activate.c Outdated Show resolved Hide resolved
@nhorman
Copy link
Member

nhorman commented Jul 13, 2023

I did now :)

If a given IRQ affinity cannot be set, include strerror in the warning
message.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2184735
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Some errors reported when writing to smp_affinity are transient. For
example, when a CPU interrupt controller does not have enough room to
map the IRQ, the kernel will return "No space left on device".

This kind of situation can change over time. Do not mark the IRQ
affinity as "unmanaged". Let irqbalance try again later.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
@rjarry
Copy link
Contributor Author

rjarry commented Jul 13, 2023

Should be good now. I also have fixed a typo in a comment.

@nhorman nhorman merged commit 8b23bda into Irqbalance:master Jul 13, 2023
@rjarry rjarry deleted the blacklist branch July 13, 2023 15:31
@rjarry
Copy link
Contributor Author

rjarry commented Jul 13, 2023

Thanks 👍

}
fclose(file);
if (ret < 0)
goto error;
info->moved = 0; /*migration is done*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argl I forgot to return after this line... I'll send another patch to fix this.

@hnyman
Copy link

hnyman commented Dec 9, 2023

This seems to cause log spam for some ARM 7 based routers in OpenWrt.
Specifically the EINVAL error for some IRQs seems to repeatedly surface for some devices.

Log and strace output for IRQ 48.

Sat Dec  9 10:17:27 2023 daemon.warn irqbalance: Cannot change IRQ 53 affinity: Invalid argument
Sat Dec  9 10:17:27 2023 daemon.warn irqbalance: Cannot change IRQ 24 affinity: I/O error
Sat Dec  9 10:17:27 2023 daemon.warn irqbalance: IRQ 24 affinity is now unmanaged
Sat Dec  9 10:17:27 2023 daemon.warn irqbalance: Cannot change IRQ 48 affinity: Invalid argument


open("/proc/irq/48/smp_affinity", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 6
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6ed9000
ioctl(6, TIOCGWINSZ, 0xbeb134e8)        = -1 ENOTTY (Not a tty)
writev(6, [{iov_base="00000001", iov_len=8}, {iov_base=NULL, iov_len=0}], 2) = -1 EINVAL (Invalid argument)
close(6)                                = 0
munmap(0xb6ed9000, 4096)                = 0
clock_gettime64(CLOCK_REALTIME, {tv_sec=1702109847, tv_nsec=340549081}) = 0
sendto(5, "<28>Dec  9 08:17:27 irqbalance: "..., 80, 0, NULL, 0) = 80
writev(1, [{iov_base="Cannot change IRQ 48 affinity: I"..., iov_len=47}, {iov_base="\n", iov_len=1}], 2Cannot change IRQ 48 affinity: Invalid argument
) = 48

Removing EINVAL from the temporary error list in activate.c stops the log spam after one error:

Sat Dec  9 10:33:24 2023 daemon.warn /usr/sbin/irqbalance: Cannot change IRQ 48 affinity: Invalid argument
Sat Dec  9 10:33:24 2023 daemon.warn /usr/sbin/irqbalance: IRQ 48 affinity is now unmanaged

(Curious that it fails in this kernel 5.15 based Linux, but seems to work for the same device with a bit newer 6.1 build.)

This change is probably not a bug from irqbalance perspective, but just brings an already existing assignation problem into surface.

Not quite sure what to make out of this:

root@router1:~# cat /proc/irq/48/smp_affinity_list
0-1
root@router1:~# cat /proc/irq/48/smp_affinity
3
root@router1:~# cat /proc/irq/48/effective_affinity_list

root@router1:~# cat /proc/irq/48/effective_affinity
0
root@router1:~# cat /proc/irq/48/affinity_hint
0
root@router1:~# ls /proc/irq/48/
affinity_hint            effective_affinity_list  node                     smp_affinity_list
effective_affinity       keys                     smp_affinity             spurious

@rjarry
Copy link
Contributor Author

rjarry commented Dec 15, 2023

Hi there,

(Curious that it fails in this kernel 5.15 based Linux, but seems to work for the same device with a bit newer 6.1 build.)

This makes me think that it is a driver issue that has been fixed somehow but I could be wrong. In any case, assuming that EINVAL is not a transient issue may not be a good idea since this error is also triggered when trying to completely disable the IRQ:

https://elixir.bootlin.com/linux/v5.15/source/kernel/irq/proc.c#L167

I am not sure why you have that issue with 5.15 and not with 6.1 as this code has not changed in a long time.

@hnyman
Copy link

hnyman commented Dec 15, 2023

why you have that issue with 5.15 and not with 6.1

That was actually a wrong statement from me. The behaviour was actually the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants