Skip to content
Permalink
Vladimir-Oltea…
Switch branches/tags

Commits on Mar 18, 2021

  1. net: bridge: switchdev: let drivers inform which bridge ports are off…

    …loaded
    
    On reception of an skb, the bridge checks if it was marked as 'already
    forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
    is, it puts a mark of its own on that skb, with the switchdev mark of
    the ingress port. Then during forwarding, it enforces that the egress
    port must have a different switchdev mark than the ingress one (this is
    done in nbp_switchdev_allowed_egress).
    
    Non-switchdev drivers don't report any physical switch id (neither
    through devlink nor .ndo_get_port_parent_id), therefore the bridge
    assigns them a switchdev mark of 0, and packets coming from them will
    always have skb->offload_fwd_mark = 0. So there aren't any restrictions.
    
    Problems appear due to the fact that DSA would like to perform software
    fallback for bonding and team interfaces that the physical switch cannot
    offload.
    
             +-- br0 -+
            /   / |    \
           /   /  |     \
          /   /   |      \
         /   /    |       \
        /   /     |        \
       /    |     |       bond0
      /     |     |      /    \
     swp0  swp1  swp2  swp3  swp4
    
    There, it is desirable that the presence of swp3 and swp4 under a
    non-offloaded LAG does not preclude us from doing hardware bridging
    beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
    enough that software bridging between {swp0,swp1,swp2} and bond0 is not
    impractical.
    
    But this creates an impossible paradox given the current way in which
    port switchdev marks are assigned. When the driver receives a packet
    from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
    something.
    
    - If we set it to 0, then the bridge will forward it towards swp1, swp2
      and bond0. But the switch has already forwarded it towards swp1 and
      swp2 (not to bond0, remember, that isn't offloaded, so as far as the
      switch is concerned, ports swp3 and swp4 are not looking up the FDB,
      and the entire bond0 is a destination that is strictly behind the
      CPU). But we don't want duplicated traffic towards swp1 and swp2, so
      it's not ok to set skb->offload_fwd_mark = 0.
    
    - If we set it to 1, then the bridge will not forward the skb towards
      the ports with the same switchdev mark, i.e. not to swp1, swp2 and
      bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
      have forwarded the skb there.
    
    So the real issue is that bond0 will be assigned the same switchdev mark
    as {swp0,swp1,swp2}, because the function that assigns switchdev marks
    to bridge ports, nbp_switchdev_mark_set, recurses through bond0's lower
    interfaces until it finds something that implements devlink.
    
    A solution is to give the bridge explicit hints as to what switchdev
    mark it should use for each port.
    
    Currently, the bridging offload is very 'silent': a driver registers a
    netdevice notifier, which is put on the netns's notifier chain, and
    which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
    bridge, and the lower is an interface it knows about (one registered by
    this driver, normally). Then, from within that notifier, it does a bunch
    of stuff behind the bridge's back, without the bridge necessarily
    knowing that there's somebody offloading that port. It looks like this:
    
         ip link set swp0 master br0
                      |
                      v
       bridge calls netdev_master_upper_dev_link
                      |
                      v
            call_netdevice_notifiers
                      |
                      v
           dsa_slave_netdevice_event
                      |
                      v
            oh, hey! it's for me!
                      |
                      v
               .port_bridge_join
    
    What we do to solve the conundrum is to be less silent, and emit a
    notification back. Something like this:
    
         ip link set swp0 master br0
                      |
                      v
       bridge calls netdev_master_upper_dev_link
                      |
                      v                    bridge: Aye! I'll use this
            call_netdevice_notifiers           ^  ppid as the
                      |                        |  switchdev mark for
                      v                        |  this port, and zero
           dsa_slave_netdevice_event           |  if I got nothing.
                      |                        |
                      v                        |
            oh, hey! it's for me!              |
                      |                        |
                      v                        |
               .port_bridge_join               |
                      |                        |
                      +------------------------+
                 switchdev_bridge_port_offload(swp0)
    
    Then stacked interfaces (like bond0 on top of swp3/swp4) would be
    treated differently in DSA, depending on whether we can or cannot
    offload them.
    
    The offload case:
    
        ip link set bond0 master br0
                      |
                      v
       bridge calls netdev_master_upper_dev_link
                      |
                      v                    bridge: Aye! I'll use this
            call_netdevice_notifiers           ^  ppid as the
                      |                        |  switchdev mark for
                      v                        |        bond0.
           dsa_slave_netdevice_event           | Coincidentally (or not),
                      |                        | bond0 and swp0, swp1, swp2
                      v                        | all have the same switchdev
            hmm, it's not quite for me,        | mark now, since the ASIC
             but my driver has already         | is able to forward towards
               called .port_lag_join           | all these ports in hw.
              for it, because I have           |
          a port with dp->lag_dev == bond0.    |
                      |                        |
                      v                        |
               .port_bridge_join               |
               for swp3 and swp4               |
                      |                        |
                      +------------------------+
                switchdev_bridge_port_offload(bond0)
    
    And the non-offload case:
    
        ip link set bond0 master br0
                      |
                      v
       bridge calls netdev_master_upper_dev_link
                      |
                      v                    bridge waiting:
            call_netdevice_notifiers           ^  huh, switchdev_bridge_port_offload
                      |                        |  wasn't called, okay, I'll use a
                      v                        |  switchdev mark of zero for this one.
           dsa_slave_netdevice_event           :  Then packets received on swp0 will
                      |                        :  not be forwarded towards swp1, but
                      v                        :  they will towards bond0.
             it's not for me, but
           bond0 is an upper of swp3
          and swp4, but their dp->lag_dev
           is NULL because they couldn't
                offload it.
    
    Basically we can draw the conclusion that the lowers of a bridge port
    can come and go, so depending on the configuration of lowers for a
    bridge port, it can dynamically toggle between offloaded and unoffloaded.
    Therefore, we need an equivalent switchdev_bridge_port_unoffload too.
    
    This patch changes the way any switchdev driver interacts with the
    bridge. From now on, everybody needs to call switchdev_bridge_port_offload,
    otherwise the bridge will treat the port as non-offloaded and allow
    software flooding to other ports from the same ASIC.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  2. net: dsa: return -EOPNOTSUPP when driver does not implement .port_lag…

    …_join
    
    The DSA core has a layered structure, and even though we end up
    returning 0 (success) to user space when setting a bonding/team upper
    that can't be offloaded, some parts of the framework actually need to
    know that we couldn't offload that.
    
    For example, if dsa_switch_lag_join returns 0 as it currently does,
    dsa_port_lag_join has no way to tell a successful offload from a
    software fallback, and it will call dsa_port_bridge_join afterwards.
    Then we'll think we're offloading the bridge master of the LAG, when in
    fact we're not even offloading the LAG. In turn, this will make us set
    skb->offload_fwd_mark = true, which is incorrect and the bridge doesn't
    like it.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  3. net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge

    DSA has gained the recent ability to deal gracefully with upper
    interfaces it cannot offload, such as the bridge, bonding or team
    drivers. When such uppers exist, the ports are still in standalone mode
    as far as the hardware is concerned.
    
    But when we deliver packets to the software bridge in order for that to
    do the forwarding, there is an unpleasant surprise in that the bridge
    will refuse to forward them. This is because we unconditionally set
    skb->offload_fwd_mark = true, meaning that the bridge thinks the frames
    were already forwarded in hardware by us.
    
    Since dp->bridge_dev is populated only when there is hardware offload
    for it, but not in the software fallback case, let's introduce a new
    helper that can be called from the tagger data path which sets the
    skb->offload_fwd_mark accordingly to zero when there is no hardware
    offload for bridging. This lets the bridge forward packets back to other
    interfaces of our switch, if needed.
    
    Without this change, sending a packet to the CPU for an unoffloaded
    interface triggers this WARN_ON:
    
    void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
    			      struct sk_buff *skb)
    {
    	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
    		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
    }
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  4. net: ocelot: replay switchdev events when joining bridge

    The premise of this change is that the switchdev port attributes and
    objects offloaded by ocelot might have been missed when we are joining
    an already existing bridge port, such as a bonding interface.
    
    The patch pulls these switchdev attributes and objects from the bridge,
    on behalf of the 'bridge port' net device which might be either the
    ocelot switch interface, or the bonding upper interface.
    
    The ocelot_net.c belongs strictly to the switchdev ocelot driver, while
    ocelot.c is part of a library shared with the DSA felix driver.
    The ocelot_port_bridge_leave function (part of the common library) used
    to call ocelot_port_vlan_filtering(false), something which is not
    necessary for DSA, since the framework deals with that already there.
    So we move this function to ocelot_switchdev_unsync, which is specific
    to the switchdev driver.
    
    The code movement described above makes ocelot_port_bridge_leave no
    longer return an error code, so we change its type from int to void.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  5. net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged…

    … LAG
    
    Similar to the DSA situation, ocelot supports LAG offload but treats
    this scenario improperly:
    
    ip link add br0 type bridge
    ip link add bond0 type bond
    ip link set bond0 master br0
    ip link set swp0 master bond0
    
    We do the same thing as we do there, which is to simulate a 'bridge join'
    on 'lag join', if we detect that the bonding upper has a bridge upper.
    
    Again, same as DSA, ocelot supports software fallback for LAG, and in
    that case, we should avoid calling ocelot_netdevice_changeupper.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  6. net: ocelot: support multiple bridges

    The ocelot switches are a bit odd in that they do not have an STP state
    to put the ports into. Instead, the forwarding configuration is delayed
    from the typical port_bridge_join into stp_state_set, when the port enters
    the BR_STATE_FORWARDING state.
    
    I can only guess that the implementation of this quirk is the reason that
    led to the simplification of the driver such that only one bridge could
    be offloaded at a time.
    
    We can simplify the data structures somewhat, and introduce a per-port
    bridge device pointer and STP state, similar to how the LAG offload
    works now (there we have a per-port bonding device pointer and TX
    enabled state). This allows offloading multiple bridges with relative
    ease, while still keeping in place the quirk to delay the programming of
    the PGIDs.
    
    We actually need this change now because we need to remove the bogus
    restriction from ocelot_bridge_stp_state_set that ocelot->bridge_mask
    needs to contain BIT(port), otherwise that function is a no-op.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  7. net: dsa: replay VLANs installed on port when joining the bridge

    Currently this simple setup:
    
    ip link add br0 type bridge vlan_filtering 1
    ip link add bond0 type bond
    ip link set bond0 master br0
    ip link set swp0 master bond0
    
    will not work because the bridge has created the PVID in br_add_if ->
    nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,
    but that was too early, since swp0 was not yet a lower of bond0, so it
    had no reason to act upon that notification.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  8. net: dsa: replay port and local fdb entries when joining the bridge

    When a DSA port joins a LAG that already had an FDB entry pointing to it:
    
    ip link set bond0 master br0
    bridge fdb add dev bond0 00:01:02:03:04:05 master static
    ip link set swp0 master bond0
    
    the DSA port will have no idea that this FDB entry is there, because it
    missed the switchdev event emitted at its creation.
    
    Ido Schimmel pointed this out during a discussion about challenges with
    switchdev offloading of stacked interfaces between the physical port and
    the bridge, and recommended to just catch that condition and deny the
    CHANGEUPPER event:
    https://lore.kernel.org/netdev/20210210105949.GB287766@shredder.lan/
    
    But in fact, we might need to deal with the hard thing anyway, which is
    to replay all FDB addresses relevant to this port, because it isn't just
    static FDB entries, but also local addresses (ones that are not
    forwarded but terminated by the bridge). There, we can't just say 'oh
    yeah, there was an upper already so I'm not joining that'.
    
    So, similar to the logic for replaying MDB entries, add a function that
    must be called by individual switchdev drivers and replays local FDB
    entries as well as ones pointing towards a bridge port. This time, we
    use the atomic switchdev notifier block, since that's what FDB entries
    expect for some reason.
    
    Reported-by: Ido Schimmel <idosch@idosch.org>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  9. net: dsa: replay port and host-joined mdb entries when joining the br…

    …idge
    
    I have udhcpcd in my system and this is configured to bring interfaces
    up as soon as they are created.
    
    I create a bridge as follows:
    
    ip link add br0 type bridge
    
    As soon as I create the bridge and udhcpcd brings it up, I have some
    other crap (avahi) that starts sending some random IPv6 packets to
    advertise some local services, and from there, the br0 bridge joins the
    following IPv6 groups:
    
    33:33:ff:6d:c1:9c vid 0
    33:33:00:00:00:6a vid 0
    33:33:00:00:00:fb vid 0
    
    br_dev_xmit
    -> br_multicast_rcv
       -> br_ip6_multicast_add_group
          -> __br_multicast_add_group
             -> br_multicast_host_join
                -> br_mdb_notify
    
    This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
    hooked up, and switchdev will attempt to offload the host joined groups
    to an empty list of ports. Of course nobody offloads them.
    
    Then when we add a port to br0:
    
    ip link set swp0 master br0
    
    the bridge doesn't replay the host-joined MDB entries from br_add_if,
    and eventually the host joined addresses expire, and a switchdev
    notification for deleting it is emitted, but surprise, the original
    addition was already completely missed.
    
    The strategy to address this problem is to replay the MDB entries (both
    the port ones and the host joined ones) when the new port joins the
    bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
    be populated and only then attached to a bridge that you offload).
    However there are 2 possibilities: the addresses can be 'pushed' by the
    bridge into the port, or the port can 'pull' them from the bridge.
    
    Considering that in the general case, the new port can be really late to
    the party, and there may have been many other switchdev ports that
    already received the initial notification, we would like to avoid
    delivering duplicate events to them, since they might misbehave. And
    currently, the bridge calls the entire switchdev notifier chain, whereas
    for replaying it should just call the notifier block of the new guy.
    But the bridge doesn't know what is the new guy's notifier block, it
    just knows where the switchdev notifier chain is. So for simplification,
    we make this a driver-initiated pull for now, and the notifier block is
    passed as an argument.
    
    To emulate the calling context for mdb objects (deferred and put on the
    blocking notifier chain), we must iterate under RCU protection through
    the bridge's mdb entries, queue them, and only call them once we're out
    of the RCU read-side critical section.
    
    Suggested-by: Ido Schimmel <idosch@idosch.org>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  10. net: dsa: sync ageing time when joining the bridge

    The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from:
    
    sysfs/ioctl/netlink
    -> br_set_ageing_time
       -> __set_ageing_time
    
    therefore not at bridge port creation time, so:
    (a) drivers had to hardcode the initial value for the address ageing time,
        because they didn't get any notification
    (b) that hardcoded value can be out of sync, if the user changes the
        ageing time before enslaving the port to the bridge
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  11. net: dsa: sync multicast router state when joining the bridge

    Make sure that the multicast router setting of the bridge is picked up
    correctly by DSA when joining, regardless of whether there are
    sandwiched interfaces or not. The SWITCHDEV_ATTR_ID_BRIDGE_MROUTER port
    attribute is only emitted from br_mc_router_state_change.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  12. net: dsa: sync up VLAN filtering state when joining the bridge

    This is the same situation as for other switchdev port attributes: if we
    join an already-created bridge port, such as a bond master interface,
    then we can miss the initial switchdev notification emitted by the
    bridge for this port.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  13. net: dsa: sync up with bridge port's STP state when joining

    It may happen that we have the following topology:
    
    ip link add br0 type bridge stp_state 1
    ip link add bond0 type bond
    ip link set bond0 master br0
    ip link set swp0 master bond0
    ip link set swp1 master bond0
    
    STP decides that it should put bond0 into the BLOCKING state, and
    that's that. The ports that are actively listening for the switchdev
    port attributes emitted for the bond0 bridge port (because they are
    offloading it) and have the honor of seeing that switchdev port
    attribute can react to it, so we can program swp0 and swp1 into the
    BLOCKING state.
    
    But if then we do:
    
    ip link set swp2 master bond0
    
    then as far as the bridge is concerned, nothing has changed: it still
    has one bridge port. But this new bridge port will not see any STP state
    change notification and will remain FORWARDING, which is how the
    standalone code leaves it in.
    
    Add a function to the bridge which retrieves the current STP state, such
    that drivers can synchronize to it when they may have missed switchdev
    events.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  14. net: dsa: inherit the actual bridge port flags at join time

    DSA currently assumes that the bridge port starts off with this
    constellation of bridge port flags:
    
    - learning on
    - unicast flooding on
    - multicast flooding on
    - broadcast flooding on
    
    just by virtue of code copy-pasta from the bridge layer (new_nbp).
    This was a simple enough strategy thus far, because the 'bridge join'
    moment always coincided with the 'bridge port creation' moment.
    
    But with sandwiched interfaces, such as:
    
     br0
      |
    bond0
      |
     swp0
    
    it may happen that the user has had time to change the bridge port flags
    of bond0 before enslaving swp0 to it. In that case, swp0 will falsely
    assume that the bridge port flags are those determined by new_nbp, when
    in fact this can happen:
    
    ip link add br0 type bridge
    ip link add bond0 type bond
    ip link set bond0 master br0
    ip link set bond0 type bridge_slave learning off
    ip link set swp0 master br0
    
    Now swp0 has learning enabled, bond0 has learning disabled. Not nice.
    
    Fix this by "dumpster diving" through the actual bridge port flags with
    br_port_flag_is_set, at bridge join time.
    
    We use this opportunity to split dsa_port_change_brport_flags into two
    distinct functions called dsa_port_inherit_brport_flags and
    dsa_port_clear_brport_flags, now that the implementation for the two
    cases is no longer similar.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  15. net: dsa: pass extack to dsa_port_{bridge,lag}_join

    This is a pretty noisy change that was broken out of the larger change
    for replaying switchdev attributes and objects at bridge join time,
    which is when these extack objects are actually used.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  16. net: dsa: call dsa_port_bridge_join when joining a LAG that is alread…

    …y in a bridge
    
    DSA can properly detect and offload this sequence of operations:
    
    ip link add br0 type bridge
    ip link add bond0 type bond
    ip link set swp0 master bond0
    ip link set bond0 master br0
    
    But not this one:
    
    ip link add br0 type bridge
    ip link add bond0 type bond
    ip link set bond0 master br0
    ip link set swp0 master bond0
    
    Actually the second one is more complicated, due to the elapsed time
    between the enslavement of bond0 and the offloading of it via swp0, a
    lot of things could have happened to the bond0 bridge port in terms of
    switchdev objects (host MDBs, VLANs, altered STP state etc). So this is
    a bit of a can of worms, and making sure that the DSA port's state is in
    sync with this already existing bridge port is handled in the next
    patches.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    vladimiroltean authored and intel-lab-lkp committed Mar 18, 2021
  17. Merge branch 'octeon-tc-offloads'

    Naveen Mamindlapalli says:
    
    ====================
    Add tc hardware offloads
    
    This patch series adds support for tc hardware offloads.
    
    Patch #1 adds support for offloading flows that matches IP tos and IP
             protocol which will be used by tc hw offload support. Also
             added ethtool n-tuple filter to code to offload the flows
             matching the above fields.
    Patch #2 adds tc flower hardware offload support on ingress traffic.
    Patch #3 adds TC flower offload stats.
    Patch #4 adds tc TC_MATCHALL egress ratelimiting offload.
    
    * tc flower hardware offload in PF driver
    
    The driver parses the flow match fields and actions received from the tc
    subsystem and adds/delete MCAM rules for the same. Each flow contains set
    of match and action fields. If the action or fields are not supported,
    the rule cannot be offloaded to hardware. The tc uses same set of MCAM
    rules allocated for ethtool n-tuple filters. So, at a time only one entity
    can offload the flows to hardware, they're made mutually exclusive in the
    driver.
    
    Following match and actions are supported.
    
    Match: Eth dst_mac, EtherType, 802.1Q {vlan_id,vlan_prio}, vlan EtherType,
           IP proto {tcp,udp,sctp,icmp,icmp6}, IPv4 tos, IPv4{dst_ip,src_ip},
           L4 proto {dst_port|src_port number}.
    Actions: drop, accept, vlan pop, redirect to another port on the device.
    
    The Hardware stats are also supported. Currently only packet counter stats
    are updated.
    
    * tc egress rate limiting support
    Added TC-MATCHALL classifier offload with police action applied for all
    egress traffic on the specified interface.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Mar 18, 2021
  18. octeontx2-pf: TC_MATCHALL egress ratelimiting offload

    Add TC_MATCHALL egress ratelimiting offload support with POLICE
    action for entire traffic going out of the interface.
    
    Eg: To ratelimit egress traffic to 100Mbps
    
    $ ethtool -K eth0 hw-tc-offload on
    $ tc qdisc add dev eth0 clsact
    $ tc filter add dev eth0 egress matchall skip_sw \
                    action police rate 100Mbit burst 16Kbit
    
    HW supports a max burst size of ~128KB.
    Only one ratelimiting filter can be installed at a time.
    
    Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
    Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Sunil Goutham authored and davem330 committed Mar 18, 2021
  19. octeontx2-pf: add tc flower stats handler for hw offloads

    Add support to get the stats for tc flower flows that are
    offloaded to hardware. To support this feature, added a
    new AF mbox handler which returns the MCAM entry stats
    for a flow that has hardware stat counter enabled.
    
    Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Naveen Mamindlapalli authored and davem330 committed Mar 18, 2021
  20. octeontx2-pf: Add tc flower hardware offload on ingress traffic

    This patch adds support for tc flower hardware offload on ingress
    traffic. Since the tc-flower filter rules use the same set of MCAM
    rules as the n-tuple filters, the n-tuple filters and tc flower
    rules are mutually exclusive. When one of the feature is enabled
    using ethtool, the other feature is disabled in the driver. By default
    the driver enables n-tuple filters during initialization.
    
    The following flow keys are supported.
        -> Ethernet: dst_mac
        -> L2 proto: all protocols
        -> VLAN (802.1q): vlan_id/vlan_prio
        -> IPv4: dst_ip/src_ip/ip_proto{tcp|udp|sctp|icmp}/ip_tos
        -> IPv6: ip_proto{icmpv6}
        -> L4(tcp/udp/sctp): dst_port/src_port
    
    The following flow actions are supported.
        -> drop
        -> accept
        -> redirect
        -> vlan pop
    
    The flow action supports multiple actions when vlan pop is specified
    as the first action. The redirect action supports redirecting to the
    PF/VF of same PCI device. Redirecting to other PCI NIX devices is not
    supported.
    
    Example #1: Add a tc filter rule to drop UDP traffic with dest port 80
        # ethtool -K eth0 hw-tc-offload on
        # tc qdisc add dev eth0 ingress
        # tc filter add dev eth0 protocol ip parent ffff: flower ip_proto \
              udp dst_port 80 action drop
    
    Example #2: Add a tc filter rule to redirect ingress traffic on eth0
    with vlan id 3 to eth6 (ex: eth0 vf0) after stripping the vlan hdr.
        # ethtool -K eth0 hw-tc-offload on
        # tc qdisc add dev eth0 ingress
        # tc filter add dev eth0 parent ffff: protocol 802.1Q flower \
              vlan_id 3 vlan_ethtype ipv4 action vlan pop action mirred \
              ingress redirect dev eth6
    
    Example #3: List the ingress filter rules
        # tc -s filter show dev eth4 ingress
    
    Example #4: Delete tc flower filter rule with handle 0x1
        # tc filter del dev eth0 ingress protocol ip pref 49152 \
          handle 1 flower
    
    Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Naveen Mamindlapalli authored and davem330 committed Mar 18, 2021
  21. octeontx2-pf: Add ip tos and ip proto icmp/icmpv6 flow offload support

    Add support for programming the HW MCAM match key with IP tos, IP(v6)
    proto icmp/icmpv6, allowing flow offload rules to be installed using
    those fields. The NPC HW extracts layer type, which will be used as a
    matching criteria for different IP protocols.
    
    The ethtool n-tuple filter logic has been updated to parse the IP tos
    and l4proto for HW offloading. l4proto tcp/udp/sctp/ah/esp/icmp are
    supported. See example usage below.
    
    Ex: Redirect l4proto icmp to vf 0 queue 0
    ethtool -U eth0 flow-type ip4 l4proto 1 action vf 0 queue 0
    
    Ex: Redirect flow with ip tos 8 to vf 0 queue 0
    ethtool -U eth0 flow-type ip4 tos 8 vf 0 queue 0
    
    Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Naveen Mamindlapalli authored and davem330 committed Mar 18, 2021

Commits on Mar 17, 2021

  1. net: macb: simplify clk_init with dev_err_probe

    On some platforms, e.g., the ZynqMP, devm_clk_get can return
    -EPROBE_DEFER if the clock controller, which is implemented in firmware,
    has not been probed yet.
    
    As clk_init is only called during probe, use dev_err_probe to simplify
    the error message and hide it for -EPROBE_DEFER.
    
    Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
    Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    tretter authored and davem330 committed Mar 17, 2021
  2. Merge branch 'mv88e6393x'

    Marek Behún says:
    
    ====================
    Add support for mv88e6393x family of Marvell
    
    after 2 months I finally had time to send v17 of Amethyst patches.
    
    This series is tested on Marvell CN9130-CRB.
    
    Changes since v16:
    - dropped patches adding 5gbase-r, since they are already merged
    - rebased onto net-next/master
    - driver API renamed set_egress_flood() method into 2 methods for
      ucast/mcast floods, so this is fixed
    
    Changes from v15:
    - put 10000baseKR_Full back into phylink_validate method for Amethyst,
      it seems I misunderstood the meaning behind things and removed it
      from v15
    - removed erratum 3.7, since the procedure is done anyway in
      mv88e6390_serdes_pcs_config
    - renumbered errata 3.6 and 3.8 to 4.6 and 4.8, according to newer
      version of the errata document
    - refactored errata code a little and removed duplicate macro
      definitions (for example MV88E6390_SGMII_CONTROL is already called
      MV88E6390_SGMII_BMCR)
    
    Changes from v14:
    - added my Signed-off-by tags to Pavana's patches, since I am sending
      them (as suggested by Andrew)
    - added documentation to second patch adding 5gbase-r mode (as requested
      by Russell)
    - added Reviewed-by tags
    - applied Vladimir's suggestions:
      - reduced indentation level in mv88e6xxx_set_egress_port and
        mv88e6393x_serdes_port_config
      - removed 10000baseKR_Full from mv88e6393x_phylink_validate
      - removed PHY_INTERFACE_MODE_10GKR from mv88e6xxx_port_set_cmode
    
    Changes from v13:
    - added patch that wraps .set_egress_port into mv88e6xxx_set_egress_port,
      so that we do not have to set chip->*gress_dest_port members in every
      implementation of this method
    - for the patch that adds Amethyst support:
      - added more information into commit message
      - added these methods for mv88e6393x_ops:
          .port_sync_link
          .port_setup_message_port
          .port_max_speed_mode (new implementation needed)
          .atu_get_hash
          .atu_set_hash
          .serdes_pcs_config
          .serdes_pcs_an_restart
          .serdes_pcs_link_up
      - this device can set upstream port per port, so implement
          .port_set_upstream_port
        instead of
          .set_cpu_port
      - removed USXGMII cmode (not yet supported, working on it)
      - added debug messages into mv88e6393x_port_set_speed_duplex
      - added Amethyst errata 4.5 (EEE should be disabled on SERDES ports)
      - fixed 5gbase-r serdes configuration and interrupt handling
      - refactored mv88e6393x_serdes_setup_errata
      - refactored mv88e6393x_port_policy_write
    - added patch implementing .port_set_policy for Amethyst
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Mar 17, 2021
  3. net: dsa: mv88e6xxx: implement .port_set_policy for Amethyst

    The 16-bit Port Policy CTL register from older chips is on 6393x changed
    to Port Policy MGMT CTL, which can access more data, but indirectly and
    via 8-bit registers.
    
    The original 16-bit value is divided into first two 8-bit register in
    the Port Policy MGMT CTL.
    
    We can therefore use the previous code to compute the mask and shift,
    and then
    - if 0 <= shift < 8, we access register 0 in Port Policy MGMT CTL
    - if 8 <= shift < 16, we access register 1 in Port Policy MGMT CTL
    
    There are in fact other possible policy settings for Amethyst which
    could be added here, but this can be done in the future.
    
    Signed-off-by: Marek Behún <kabel@kernel.org>
    Reviewed-by: Pavana Sharma <pavana.sharma@digi.com>
    Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    elkablo authored and davem330 committed Mar 17, 2021
  4. net: dsa: mv88e6xxx: add support for mv88e6393x family

    The Marvell 88E6393X device is a single-chip integration of a 11-port
    Ethernet switch with eight integrated Gigabit Ethernet (GbE)
    transceivers and three 10-Gigabit interfaces.
    
    This patch adds functionalities specific to mv88e6393x family (88E6393X,
    88E6193X and 88E6191X).
    
    The main differences between previous devices and this one are:
    - port 0 can be a SERDES port
    - all SERDESes are one-lane, eg. no XAUI nor RXAUI
    - on the other hand the SERDESes can do USXGMII, 10GBASER and 5GBASER
      (on 6191X only one SERDES is capable of more than 1g; USXGMII is not
      yet supported with this change)
    - Port Policy CTL register is changed to Port Policy MGMT CTL register,
      via which several more registers can be accessed indirectly
    - egress monitor port is configured differently
    - ingress monitor/CPU/mirror ports are configured differently and can be
      configured per port (ie. each port can have different ingress monitor
      port, for example)
    - port speed AltBit works differently than previously
    - PHY registers can be also accessed via MDIO address 0x18 and 0x19
      (on previous devices they could be accessed only via Global 2 offsets
       0x18 and 0x19, which means two indirections; this feature is not yet
       leveraged with thiis commit)
    
    Co-developed-by: Ashkan Boldaji <ashkan.boldaji@digi.com>
    Signed-off-by: Ashkan Boldaji <ashkan.boldaji@digi.com>
    Signed-off-by: Pavana Sharma <pavana.sharma@digi.com>
    Co-developed-by: Marek Behún <kabel@kernel.org>
    Signed-off-by: Marek Behún <kabel@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Pavana Sharma authored and davem330 committed Mar 17, 2021
  5. net: dsa: mv88e6xxx: wrap .set_egress_port method

    There are two implementations of the .set_egress_port method, and both
    of them, if successful, set chip->*gress_dest_port variable.
    
    To avoid code repetition, wrap this method into
    mv88e6xxx_set_egress_port.
    
    Signed-off-by: Marek Behún <kabel@kernel.org>
    Reviewed-by: Pavana Sharma <pavana.sharma@digi.com>
    Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    elkablo authored and davem330 committed Mar 17, 2021
  6. net: dsa: mv88e6xxx: change serdes lane parameter type from u8 type t…

    …o int
    
    Returning 0 is no more an error case with MV88E6393 family
    which has serdes lane numbers 0, 9 or 10.
    So with this change .serdes_get_lane will return lane number
    or -errno (-ENODEV or -EOPNOTSUPP).
    
    Signed-off-by: Pavana Sharma <pavana.sharma@digi.com>
    Reviewed-by: Andrew Lunn <andrew@lunn.ch>
    Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
    Signed-off-by: Marek Behún <kabel@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Pavana Sharma authored and davem330 committed Mar 17, 2021
  7. ethernet/microchip:remove unneeded variable: "ret"

    remove unneeded variable: "ret".
    
    Signed-off-by: dingsenjie <dingsenjie@yulong.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    dingsenjie authored and davem330 committed Mar 17, 2021
  8. ethernet/broadcom:remove unneeded variable: "ret"

    remove unneeded variable: "ret".
    
    Signed-off-by: dingsenjie <dingsenjie@yulong.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    dingsenjie authored and davem330 committed Mar 17, 2021
  9. net: stmmac: add per-queue TX & RX coalesce ethtool support

    Extending the driver to support per-queue RX and TX coalesce settings in
    order to support below commands:
    
    To show per-queue coalesce setting:-
     $ ethtool --per-queue <DEVNAME> queue_mask <MASK> --show-coalesce
    
    To set per-queue coalesce setting:-
     $ ethtool --per-queue <DEVNAME> queue_mask <MASK> --coalesce \
         [rx-usecs N] [rx-frames M] [tx-usecs P] [tx-frames Q]
    
    Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    elvinongbl authored and davem330 committed Mar 17, 2021
  10. Merge branch 'dsa-doc-fixups'

    Vladimir Oltean says:
    
    ====================
    DSA/switchdev documentation fixups
    
    These are some small fixups after the recently merged documentation
    update.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Mar 17, 2021
  11. Documentation: networking: dsa: mention that the master is brought up…

    … automatically
    
    Since commit 9d5ef19 ("net: dsa: automatically bring up DSA master
    when opening user port"), DSA manages the administrative status of the
    host port automatically. Update the configuration steps to reflect this.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    vladimiroltean authored and davem330 committed Mar 17, 2021
  12. Documentation: networking: dsa: demote subsections to simple emphasiz…

    …ed words
    
    "make htmldocs" complains:
    configuration.rst:165: WARNING: duplicate label networking/dsa/configuration:single port, other instance in (...)
    configuration.rst:212: WARNING: duplicate label networking/dsa/configuration:bridge, other instance in (...)
    configuration.rst:252: WARNING: duplicate label networking/dsa/configuration:gateway, other instance in (...)
    
    And for good reason, because the "single port", "bridge" and "gateway"
    use cases are replicated twice, once for normal taggers and twice for
    DSA_TAG_PROTO_NONE. So when trying to reference these sections via a
    hyperlink such as:
    
    https://www.kernel.org/doc/html/latest/networking/dsa/configuration.html#single-port
    
    it will always reference the first occurrence, and never the second one.
    
    This change makes the "single port", "bridge" and "gateway"
    configuration examples consistent with the formatting used in the
    "Configuration showcases" subsection.
    
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    vladimiroltean authored and davem330 committed Mar 17, 2021
  13. Documentation: networking: dsa: add missing new line in devlink section

    "make htmldocs" produces these warnings:
    Documentation/networking/dsa/dsa.rst:468: WARNING: Unexpected indentation.
    Documentation/networking/dsa/dsa.rst:477: WARNING: Block quote ends without a blank line; unexpected unindent.
    
    Fixes: 8411abb ("Documentation: networking: dsa: mention integration with devlink")
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    vladimiroltean authored and davem330 committed Mar 17, 2021
  14. Documentation: networking: switchdev: add missing "and" word

    Even though this is clear from the context, it is nice to actually be
    grammatically correct.
    
    Fixes: 0f22ad4 ("Documentation: networking: switchdev: clarify device driver behavior")
    Reported-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    vladimiroltean authored and davem330 committed Mar 17, 2021
Older