Xuan-Zhuo/virt…
Commits on Apr 13, 2021
-
virtio-net: xsk zero copy xmit kick by threshold
After testing, the performance of calling kick every time is not stable. And if all the packets are sent and kicked again, the performance is not good. So add a module parameter to specify how many packets are sent to call a kick. 8 is a relatively stable value with the best performance. Here is the pps of the test of xsk_kick_thr under different values (from 1 to 64). thr PPS thr PPS thr PPS 1 2924116.74247 | 23 3683263.04348 | 45 2777907.22963 2 3441010.57191 | 24 3078880.13043 | 46 2781376.21739 3 3636728.72378 | 25 2859219.57656 | 47 2777271.91304 4 3637518.61468 | 26 2851557.9593 | 48 2800320.56575 5 3651738.16251 | 27 2834783.54408 | 49 2813039.87599 6 3652176.69231 | 28 2847012.41472 | 50 3445143.01839 7 3665415.80602 | 29 2860633.91304 | 51 3666918.01281 8 3665045.16555 | 30 2857903.5786 | 52 3059929.2709 9 3671023.2401 | 31 2835589.98963 | 53 2831515.21739 10 3669532.23274 | 32 2862827.88706 | 54 3451804.07204 11 3666160.37749 | 33 2871855.96696 | 55 3654975.92385 12 3674951.44813 | 34 3434456.44816 | 56 3676198.3188 13 3667447.57331 | 35 3656918.54177 | 57 3684740.85619 14 3018846.0503 | 36 3596921.16722 | 58 3060958.8594 15 2792773.84505 | 37 3603460.63903 | 59 2828874.57191 16 3430596.3602 | 38 3595410.87666 | 60 3459926.11027 17 3660525.85806 | 39 3604250.17819 | 61 3685444.47599 18 3045627.69054 | 40 3596542.28428 | 62 3049959.0809 19 2841542.94177 | 41 3600705.16054 | 63 2806280.04013 20 2830475.97348 | 42 3019833.71191 | 64 3448494.3913 21 2845655.55789 | 43 2752951.93264 | 22 3450389.84365 | 44 2753107.27164 | It can be found that when the value of xsk_kick_thr is relatively small, the performance is not good, and when its value is greater than 13, the performance will be more irregular and unstable. It looks similar from 3 to 13, I chose 8 as the default value. The test environment is qemu + vhost-net. I modified vhost-net to drop the packets sent by vm directly, so that the cpu of vm can run higher. By default, the processes in the vm and the cpu of softirqd are too low, and there is no obvious difference in the test data. During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was run for 300s, the pps of every second was recorded, and the average of the pps was finally taken. The vhost process cpu on the host has also reached 100%. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
-
virtio-net: xsk zero copy xmit implement wakeup and xmit
This patch implements the core part of xsk zerocopy xmit. When the user calls sendto to consume the data in the xsk tx queue, virtnet_xsk_wakeup() will be called. In wakeup, it will try to send a part of the data directly. There are two purposes for this realization: 1. Send part of the data quickly to reduce the transmission delay of the first packet. 2. Trigger tx interrupt, start napi to consume xsk tx data. All sent xsk packets share the virtio-net header of xsk_hdr. If xsk needs to support csum and other functions later, consider assigning xsk hdr separately for each sent packet. There are now three situations in free_old_xmit(): skb, xdp frame, xsk desc. Based on the last two bit of ptr returned by virtqueue_get_buf(): 00 is skb by default. 01 represents the packet sent by xdp 10 is the packet sent by xsk If the xmit work of xsk has not been completed, but the ring is full, napi must first exit and wait for the ring to be available, so need_wakeup() is set. If free_old_xmit() is called first by start_xmit(), we can quickly wake up napi to execute xsk xmit task. When recycling, we need to count the number of bytes sent, so put xsk desc->len into the ptr pointer. Because ptr does not point to meaningful objects in xsk. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> -
virtio-net: xsk zero copy xmit setup
xsk is a high-performance packet receiving and sending technology. This patch implements the binding and unbinding operations of xsk and the virtio-net queue for xsk zero copy xmit. The xsk zero copy xmit depends on tx napi. So if tx napi is not true, an error will be reported. And the entire operation is under the protection of rtnl_lock. If xsk is active, it will prevent ethtool from modifying tx napi. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
-
virtio-net: virtnet_poll_tx support budget check
virtnet_poll_tx() check the work done like other network card drivers. When work < budget, napi_poll() in dev.c will exit directly. And virtqueue_napi_complete() will be called to close napi. If closing napi fails or there is still data to be processed, virtqueue_napi_complete() will make napi schedule again, and no conflicts with the logic of napi_poll(). When work == budget, virtnet_poll_tx() will return the var 'work', and the napi_poll() in dev.c will re-add napi to the queue. The purpose of this patch is to support xsk xmit in virtio_poll_tx for subsequent patch. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
-
virtio-net: unify the code for recycling the xmit ptr
Now there are two types of "skb" and "xdp frame" during recycling old xmit. There are two completely similar and independent implementations. This is inconvenient for the subsequent addition of new types. So extract a function from this piece of code and call this function uniformly to recover old xmit ptr. Rename free_old_xmit_skbs() to free_old_xmit(). Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
-
xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
Some devices, such as virtio-net, do not directly use dma addr. These devices do not initialize dma after completing the xsk setup, so the dma check is skipped here. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
-
xsk adds an interface and returns the page corresponding to data. virtio-net does not initialize dma, so it needs page to construct scatterlist to pass to vring. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
-
virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
virtio-net not use dma addr directly. So add this priv_flags IFF_NOT_USE_DMA_ADDR. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
-
netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
Some driver devices, such as virtio-net, do not directly use dma addr. For upper-level frameworks such as xdp socket, that need to be aware of this. So add a new priv_flag IFF_NOT_USE_DMA_ADDR. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
-
netdevice: priv_flags extend to 64bit
The size of priv_flags is 32 bits, and the number of flags currently available has reached 32. It is time to expand the size of priv_flags to 64 bits. Here the priv_flags is modified to 8 bytes, but the size of struct net_device has not changed, it is still 2176 bytes. It is because _tx is aligned based on the cache line. But there is a 4-byte hole left here. Since the fields before and after priv_flags are read mostly, I did not adjust the order of the fields here. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Commits on Apr 12, 2021
-
net: ethernet: ravb: Enable optional refclk
For devices that use a programmable clock for the AVB reference clock, the driver may need to enable them. Add code to find the optional clock and enable it when available. Signed-off-by: Adam Ford <aford173@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
dt-bindings: net: renesas,etheravb: Add additional clocks
The AVB driver assumes there is an external crystal, but it could be clocked by other means. In order to enable a programmable clock, it needs to be added to the clocks list and enabled in the driver. Since there currently only one clock, there is no clock-names list either. Update bindings to add the additional optional clock, and explicitly name both of them. Signed-off-by: Adam Ford <aford173@gmail.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yangbo Lu says: ==================== enetc: support PTP Sync packet one-step timestamping This patch-set is to add support for PTP Sync packet one-step timestamping. Since ENETC single-step register has to be configured dynamically per packet for correctionField offeset and UDP checksum update, current one-step timestamping packet has to be sent only when the last one completes transmitting on hardware. So, on the TX, this patch handles one-step timestamping packet as below: - Trasmit packet immediately if no other one in transfer, or queue to skb queue if there is already one in transfer. The test_and_set_bit_lock() is used here to lock and check state. - Start a work when complete transfer on hardware, to release the bit lock and to send one skb in skb queue if has. Changes for v2: - Rebased. - Fixed issues from patchwork checks. - netif_tx_lock for one-step timestamping packet sending. Changes for v3: - Used system workqueue. - Set bit lock when transmitted one-step packet, and scheduled work when completed. The worker cleared the bit lock, and transmitted one skb in skb queue if has, instead of a loop. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
enetc: support PTP Sync packet one-step timestamping
This patch is to add support for PTP Sync packet one-step timestamping. Since ENETC single-step register has to be configured dynamically per packet for correctionField offeset and UDP checksum update, current one-step timestamping packet has to be sent only when the last one completes transmitting on hardware. So, on the TX, this patch handles one-step timestamping packet as below: - Trasmit packet immediately if no other one in transfer, or queue to skb queue if there is already one in transfer. The test_and_set_bit_lock() is used here to lock and check state. - Start a work when complete transfer on hardware, to release the bit lock and to send one skb in skb queue if has. And the configuration for one-step timestamping on ENETC before transmitting is, - Set one-step timestamping flag in extension BD. - Write 30 bits current timestamp in tstamp field of extension BD. - Update PTP Sync packet originTimestamp field with current timestamp. - Configure single-step register for correctionField offeset and UDP checksum update. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
enetc: mark TX timestamp type per skb
Mark TX timestamp type per skb on skb->cb[0], instead of global variable for all skbs. This is a preparation for one step timestamp support. For one-step timestamping enablement, there will be both one-step and two-step PTP messages to transfer. And a skb queue is needed for one-step PTP messages making sure start to send current message only after the last one completed on hardware. (ENETC single-step register has to be dynamically configured per message.) So, marking TX timestamp type per skb is required. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lijun Pan says: ==================== ibmvnic: improve error printing Patch 1 prints reset reason as a string. Patch 2 prints adapter state as a string. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
ibmvnic: print adapter state as a string
The adapter state can be added or deleted over different versions of the source code. Print a string instead of a number. Signed-off-by: Lijun Pan <lijunp213@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
ibmvnic: print reset reason as a string
The reset reason can be added or deleted over different versions of the source code. Print a string instead of a number. Signed-off-by: Lijun Pan <lijunp213@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
ibmvnic: clean up the remaining debugfs data structures
Commit e704f04 ("ibmvnic: Remove debugfs support") did not clean up everything. Remove the remaining code. Signed-off-by: Lijun Pan <lijunp213@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Merge branch 'netns-sysctl-isolation'
Jonathon Reinhart says: ==================== Ensuring net sysctl isolation This patchset is the result of an audit of /proc/sys/net to prove that it is safe to be mouted read-write in a container when a net namespace is in use. See [1]. The first commit adds code to detect sysctls which are not netns-safe, and can "leak" changes to other net namespaces. My manual audit found, and the above feature confirmed, that there are two nf_conntrack sysctls which are in fact not netns-safe. I considered sending the latter to netfilter-devel, but I think it's better to have both together on net-next: Adding only the former causes undesirable warnings in the kernel log. [1]: opencontainers/runc#2826 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
netfilter: conntrack: Make global sysctls readonly in non-init netns
These sysctls point to global variables: - NF_SYSCTL_CT_MAX (&nf_conntrack_max) - NF_SYSCTL_CT_EXPECT_MAX (&nf_ct_expect_max) - NF_SYSCTL_CT_BUCKETS (&nf_conntrack_htable_size_user) Because their data pointers are not updated to point to per-netns structures, they must be marked read-only in a non-init_net ns. Otherwise, changes in any net namespace are reflected in (leaked into) all other net namespaces. This problem has existed since the introduction of net namespaces. The current logic marks them read-only only if the net namespace is owned by an unprivileged user (other than init_user_ns). Commit d0febd8 ("netfilter: conntrack: re-visit sysctls in unprivileged namespaces") "exposes all sysctls even if the namespace is unpriviliged." Since we need to mark them readonly in any case, we can forego the unprivileged user check altogether. Fixes: d0febd8 ("netfilter: conntrack: re-visit sysctls in unprivileged namespaces") Signed-off-by: Jonathon Reinhart <Jonathon.Reinhart@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
net: Ensure net namespace isolation of sysctls
This adds an ensure_safe_net_sysctl() check during register_net_sysctl() to validate that sysctl table entries for a non-init_net netns are sufficiently isolated. To be netns-safe, an entry must adhere to at least (and usually exactly) one of these rules: 1. It is marked read-only inside the netns. 2. Its data pointer does not point to kernel/module global data. An entry which fails both of these checks is indicative of a bug, whereby a child netns can affect global net sysctl values. If such an entry is found, this code will issue a warning to the kernel log, and force the entry to be read-only to prevent a leak. To test, simply create a new netns: $ sudo ip netns add dummy As it sits now, this patch will WARN for two sysctls which will be addressed in a subsequent patch: - /proc/sys/net/netfilter/nf_conntrack_max - /proc/sys/net/netfilter/nf_conntrack_expect_max Signed-off-by: Jonathon Reinhart <Jonathon.Reinhart@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> -
nfc: pn533: remove redundant assignment
In many places,first assign a value to a variable and then return the variable. which is redundant, we should directly return the value. in pn533_rf_field funciton,return rc also in the if statement, so we use return 0 to replace the last return rc. Signed-off-by: wengjianfeng <wengjianfeng@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Merge branch 'bnxt_en-error-recovery'
Michael Chan says: ==================== bnxt_en: Error recovery fixes. This series adds some fixes and enhancements to the error recovery logic. The health register logic is improved and we also add missing code to free and re-create VF representors in the firmware after error recovery. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
bnxt_en: Free and allocate VF-Reps during error recovery.
During firmware recovery, VF-Rep configuration in the firmware is lost. Fix it by freeing and (re)allocating VF-Reps in FW at relevant points during the error recovery process. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
bnxt_en: Refactor __bnxt_vf_reps_destroy().
Add a new helper function __bnxt_free_one_vf_rep() to free one VF rep. We also reintialize the VF rep fields to proper initial values so that the function can be used without freeing the VF rep data structure. This will be used in subsequent patches to free and recreate VF reps after error recovery. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
bnxt_en: Refactor bnxt_vf_reps_create().
Add a new function bnxt_alloc_vf_rep() to allocate a VF representor. This function will be needed in subsequent patches to recreate the VF reps after error recovery. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
bnxt_en: Invalidate health register mapping at the end of probe.
After probe is successful, interface may not be bought up in all the cases and health register mapping could be invalid if firmware undergoes reset. Fix it by invalidating the health register at the end of probe. It will be remapped during ifup. Fixes: 43a440c ("bnxt_en: Improve the status_reliable flag in bp->fw_health.") Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
bnxt_en: Treat health register value 0 as valid in bnxt_try_reover_fw().
The retry loop in bnxt_try_recover_fw() should not abort when the health register value is 0. It is a valid value that indicates the firmware is booting up. Fixes: 861aae7 ("bnxt_en: Enhance retry of the first message to the firmware.") Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
net: seg6: trivial fix of a spelling mistake in comment
There is a comment spelling mistake "interfarence" -> "interference" in function parse_nla_action(). Fix it. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: David S. Miller <davem@davemloft.net>
-
net: hns3: Fix potential null pointer defererence of null ae_dev
The reset_prepare and reset_done calls have a null pointer check on ae_dev however ae_dev is being dereferenced via the call to ns3_is_phys_func with the ae->pdev argument. Fix this by performing a null pointer check on ae_dev and hence short-circuiting the dereference to ae_dev on the call to ns3_is_phys_func. Addresses-Coverity: ("Dereference before null check") Fixes: 715c58e ("net: hns3: add suspend and resume pm_ops") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> -
net: thunderx: Fix unintentional sign extension issue
The shifting of the u8 integers rq->caching by 26 bits to the left will be promoted to a 32 bit signed int and then sign-extended to a u64. In the event that rq->caching is greater than 0x1f then all then all the upper 32 bits of the u64 end up as also being set because of the int sign-extension. Fix this by casting the u8 values to a u64 before the 26 bit left shift. Addresses-Coverity: ("Unintended sign extension") Fixes: 4863dea ("net: Adding support for Cavium ThunderX network controller") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> -
cxgb4: Fix unintentional sign extension issues
The shifting of the u8 integers f->fs.nat_lip[] by 24 bits to the left will be promoted to a 32 bit signed int and then sign-extended to a u64. In the event that the top bit of the u8 is set then all then all the upper 32 bits of the u64 end up as also being set because of the sign-extension. Fix this by casting the u8 values to a u64 before the 24 bit left shift. Addresses-Coverity: ("Unintended sign extension") Fixes: 12b276f ("cxgb4: add support to create hash filters") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commits on Apr 11, 2021
-
Alex Elder says: ==================== net: ipa: support two more platforms This series adds IPA support for two more Qualcomm SoCs. The first patch updates the DT binding to add compatible strings. The second temporarily disables checksum offload support for IPA version 4.5 and above. Changes are required to the RMNet driver to support the "inline" checksum offload used for IPA v4.5+, and once those are present this capability will be enabled for IPA. The third and fourth patches add configuration data for IPA versions 4.5 (used for the SDX55 SoC) and 4.11 (used for the SD7280 SoC). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
net: ipa: add IPA v4.11 configuration data
Add support for the SC7280 SoC, which includes IPA version 4.11. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>