Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blockin…
…g notifier chain
Currently, br_switchdev_fdb_notify() uses call_switchdev_notifiers (and
br_fdb_replay() open-codes the same thing). This means that drivers
handle the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events on the atomic
switchdev notifier block.
Most existing switchdev drivers either talk to firmware, or to a device
over a bus where the I/O is sleepable (SPI, I2C, MDIO etc). So there
exists an (anti)pattern where drivers make a sleepable context for
offloading the given FDB entry by registering an ordered workqueue and
scheduling work items on it, and doing all the work from there.
The problem is the inherent limitation that this design imposes upon
what a switchdev driver can do with those FDB entries.
For example, a switchdev driver might want to perform FDB isolation,
i.e. associate each FDB entry with the bridge it belongs to. Maybe the
driver associates each bridge with a number, allocating that number when
the first port of the driver joins that bridge, and freeing it when the
last port leaves it.
And this is where the problem is. When user space deletes a bridge and
all the ports leave, the bridge will notify us of the deletion of all
FDB entries in atomic context, and switchdev drivers will schedule their
private work items on their private workqueue.
The FDB entry deletion notifications will succeed, the bridge will then
finish deleting itself, but the switchdev work items have not run yet.
When they will eventually get scheduled, the aforementioned association
between the bridge_dev and a number will have already been broken by the
switchdev driver. All ports are standalone now, the bridge is a foreign
interface!
One might say "why don't you cache all your associations while you're
still in the atomic context and they're still valid, pass them by value
through your switchdev_work and work with the cached values as opposed
to the current ones?"
This option smells of poor design, because instead of fixing a central
problem, we add tens of lateral workarounds to avoid it. It should be
easier to use switchdev, not harder, and we should look at the common
patterns which lead to code duplication and eliminate them.
In this case, we must notice that
(a) switchdev already has the concept of notifiers emitted from the fast
path that are still processed by drivers from blocking context. This
is accomplished through the SWITCHDEV_F_DEFER flag which is used by
e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
(b) the bridge del_nbp() function already calls switchdev_deferred_process().
So if we could hook into that, we could have a chance that the
bridge simply waits for our FDB entry offloading procedure to finish
before it calls netdev_upper_dev_unlink() - which is almost
immediately afterwards, and also when switchdev drivers typically
break their stateful associations between the bridge upper and
private data.
So it is in fact possible to use switchdev's generic
switchdev_deferred_enqueue mechanism to get a sleepable callback, and
from there we can call_switchdev_blocking_notifiers().
In the case of br_fdb_replay(), the only code path is from
switchdev_bridge_port_offload(), which is already in blocking context.
So we don't need to go through switchdev_deferred_enqueue, and we can
just call the blocking notifier block directly.
To preserve the same behavior as before, all drivers need to have their
SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handlers moved from their switchdev
atomic notifier blocks to the blocking ones. This patch attempts to make
that trivial movement. Note that now they might schedule a work item for
nothing (since they are now called from a work item themselves), but I
don't have the energy or hardware to test all of them, so this will have
to do.
Note that previously, we were under rcu_read_lock() but now we're not.
I have eyeballed the drivers that make any sort of RCU assumption and
enclosed them between a private rcu_read_lock()/rcu_read_unlock(). This
can be dropped when the drivers themselves are reworked.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>- Loading branch information