Skip to content

atlantic stops receiving packets after a while under high load #483

@hoshinolina

Description

@hoshinolina

This issue has been happening randomly for a while now, but I only just got around to setting up an alternate network interface to investigate.

After some time with high load/IO, the 10GbE interface on a j274 stops receiving network packets. TX still works, which eventually devolves to just ARP requests (which are received by other hosts). The failure is not instant, but rather seems to take time ranging up to several minutes, during which somehow network connectivity is degraded between certain hosts.

Image

In this instance, the failure happened at 20:02 initially and most traffic flows were interrupted, but the (remote) monitoring still worked. It somehow recovered at 20:17, before failing completely at 20:31.

This is currently only happening on a j274. I don't think I've ever seen it on other Mac Mini variants (M1 Pro, M2). I'm not sure if that's just a coincidence, though (it could be related to CPU performance relative to network IO, for example).

I have a vague feeling this might be IRQ related. If there are multiple RX queues assigned to different network flows, some IRQs or queues failing first would explain the strange failure behavior where it fails partially before it fails fully.

/proc/interrupts dump during the failure:

124:        131          0     316956     321397          0    6301482    5757779    7076594 PCI-MSIX-0000:03:00.0   0 Edge      end0
125:         62          0   22534244          0     180882     109157      19396     163910 PCI-MSIX-0000:03:00.0   1 Edge      end0
126:        178          0          0     150623    9052354    3688564    1227219    4592258 PCI-MSIX-0000:03:00.0   2 Edge      end0
127:   16596356          0          0          0          0          0          0          0 PCI-MSIX-0000:03:00.0   3 Edge      end0
128:         81          0          0     236171      26504    3987434    4310273    5243217 PCI-MSIX-0000:03:00.0   4 Edge      end0
129:         85          0          0         75       8166    7977179   19628093    7158307 PCI-MSIX-0000:03:00.0   5 Edge      end0
130:        141          0          0     740925    8342324    4133795    2094545    1913432 PCI-MSIX-0000:03:00.0   6 Edge      end0
131:         80   27867953          0          0          0          0          0          0 PCI-MSIX-0000:03:00.0   7 Edge      end0
132:          0          0          0          0          0          0          0          0 PCI-MSIX-0000:03:00.0   8 Edge      end0

In this case, IRQ 128 (MSI 4) is ticking up with TX packets, but nothing else is.

ip link shows all RX packets as dropped (dropped and mcast ticking up, as well as TX normally):

2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether a4:77:f3:05:50:1b brd ff:ff:ff:ff:ff:ff
    RX:     bytes    packets errors dropped  missed   mcast
     725823928713  552326168	  0   20032	  0  449035
    TX:     bytes    packets errors dropped carrier collsns
    1818579085663 1247873976	  0	  0	  0	  0
    altname enp3s0
    altname enxa477f305501b

So the PHY is receiving packets, but they aren't making it into the kernel.

Nothing in dmesg when the problem happens. ip link set end0 down && ip link set end0 up fixes the situation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions