forked from msm8953-mainline/linux
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync #9
Closed
Closed
Sync #9
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Cited patch below blocked the TLS TX device offload unless HW_CSUM is set. This broke devices that use IP_CSUM && IP6_CSUM. Here we fix it. Note that the single HW_TLS_TX feature flag indicates support for both IPv4/6, hence it should still be disabled in case only one of (IP_CSUM | IPV6_CSUM) is set. Fixes: ae0b04b ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reported-by: Rohit Maheshwari <rohitm@chelsio.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add function to set power brake sequence. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit f01afd1. Cc: Wayne Lin <Wayne.Lin@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Harry Wentland <Harry.Wentland@amd.com> Cc: Roman Li <Roman.Li@amd.com> Cc: Bindu R <Bindu.R@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
…ger" This reverts commit 6ae09fa. Cc: Wayne Lin <Wayne.Lin@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Harry Wentland <Harry.Wentland@amd.com> Cc: Roman Li <Roman.Li@amd.com> Cc: Bindu R <Bindu.R@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit c920888. Cc: Wayne Lin <Wayne.Lin@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Harry Wentland <Harry.Wentland@amd.com> Cc: Roman Li <Roman.Li@amd.com> Cc: Bindu R <Bindu.R@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Wayne Lin <Wayne.Lin@amd.com> Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why] Find out when we try to disable CRC calculation, crc generation is still enabled. Main reason is that dc_stream_configure_crc() will never get called when the source is AMDGPU_DM_PIPE_CRC_SOURCE_NONE. [How] Add checking condition that when source is AMDGPU_DM_PIPE_CRC_SOURCE_NONE, we should also call dc_stream_configure_crc() to disable crc calculation. Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Wayne Lin <Wayne.Lin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
…/git/hid/hid Pull HID fixes from Jiri Kosina: - memory leak fix for Wacom driver (Ping Cheng) - various trivial small fixes, cleanups and device ID additions * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: logitech-hidpp: Add product ID for MX Ergo in Bluetooth mode HID: Ignore battery for Elan touchscreen on ASUS UX550 HID: logitech-dj: add the G602 receiver HID: wiimote: remove h from printk format specifier HID: uclogic: remove h from printk format specifier HID: sony: select CONFIG_CRC32 HID: sfh: fix address space confusion HID: multitouch: Enable multi-input for Synaptics pointstick/touchpad device HID: wacom: Fix memory leakage caused by kfifo_alloc
tcp_disconnect() expects the caller acquires the sock lock, but mptcp_disconnect() is not doing that. Add the missing required lock. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: 76e2a55 ("mptcp: better msk-level shutdown.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/f818e82b58a556feeb71dcccc8bf1c87aafc6175.1610638176.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When setting port traddr to INADDR_ANY, the listening cm_id->device is NULL. The associate IB device is known only when a connect request event arrives, so checking T10-PI device capability should be done at this stage. Fixes: b09160c ("nvmet-rdma: add metadata/T10-PI support") Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
We shouldn't call smp_processor_id() in a preemptible context, but this is advisory at best, so instead call __smp_processor_id(). Fixes: db5ad6b ("nvme-tcp: try to send request in queue_rq context") Reported-by: Or Gerlitz <gerlitz.or@gmail.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
When a bio merges, we can get a request that spans multiple bios, and the overall request payload size is the sum of all bios. When we calculate how much we need to send from the existing bio (and bvec), we did not take into account the iov_iter byte count cap. Since multipage bvecs support, bvecs can split in the middle which means that when we account for the last bvec send we should also take the iov_iter byte count cap as it might be lower than the last bvec size. Reported-by: Hao Wang <pkuwangh@gmail.com> Fixes: 3f2304f ("nvme-tcp: add NVMe over TCP host driver") Tested-by: Hao Wang <pkuwangh@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Discovery controllers usually don't support smart log page command. So when we connect to the discovery controller we see this warning: nvme nvme0: Failed to read smart log (error 24577) nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.123.1:8009 nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" Introduce a new helper to understand if the controller is a discovery controller and use this helper to skip nvme_init_hwmon (also use it in other places that we check if the controller is a discovery controller). Fixes: 400b6a7 ("nvme: Add hardware monitoring support") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
…/git/netdev/net Pull networking fixes from Jakub Kicinski: "We have a few fixes for long standing issues, in particular Eric's fix to not underestimate the skb sizes, and my fix for brokenness of register_netdevice() error path. They may uncover other bugs so we will keep an eye on them. Also included are Willem's fixes for kmap(_atomic). Looking at the "current release" fixes, it seems we are about one rc behind a normal cycle. We've previously seen an uptick of "people had run their test suites" / "humans actually tried to use new features" fixes between rc2 and rc3. Summary: Current release - regressions: - fix feature enforcement to allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM - dcb: accept RTM_GETDCB messages carrying set-like DCB commands if user is admin for backward-compatibility - selftests/tls: fix selftests build after adding ChaCha20-Poly1305 Current release - always broken: - ppp: fix refcount underflow on channel unbridge - bnxt_en: clear DEFRAG flag in firmware message when retry flashing - smc: fix out of bound access in the new netlink interface Previous releases - regressions: - fix use-after-free with UDP GRO by frags - mptcp: better msk-level shutdown - rndis_host: set proper input size for OID_GEN_PHYSICAL_MEDIUM request - i40e: xsk: fix potential NULL pointer dereferencing Previous releases - always broken: - skb frag: kmap_atomic fixes - avoid 32 x truesize under-estimation for tiny skbs - fix issues around register_netdevice() failures - udp: prevent reuseport_select_sock from reading uninitialized socks - dsa: unbind all switches from tree when DSA master unbinds - dsa: clear devlink port type before unregistering slave netdevs - can: isotp: isotp_getname(): fix kernel information leak - mlxsw: core: Thermal control fixes - ipv6: validate GSO SKB against MTU before finish IPv6 processing - stmmac: use __napi_schedule() for PREEMPT_RT - net: mvpp2: remove Pause and Asym_Pause support Misc: - remove from MAINTAINERS folks who had been inactive for >5yrs" * tag 'net-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits) mptcp: fix locking in mptcp_disconnect() net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM MAINTAINERS: dccp: move Gerrit Renker to CREDITS MAINTAINERS: ipvs: move Wensong Zhang to CREDITS MAINTAINERS: tls: move Aviad to CREDITS MAINTAINERS: ena: remove Zorik Machulsky from reviewers MAINTAINERS: vrf: move Shrijeet to CREDITS MAINTAINERS: net: move Alexey Kuznetsov to CREDITS MAINTAINERS: altx: move Jay Cliburn to CREDITS net: avoid 32 x truesize under-estimation for tiny skbs nt: usb: USB_RTL8153_ECM should not default to y net: stmmac: fix taprio configuration when base_time is in the past net: stmmac: fix taprio schedule configuration net: tip: fix a couple kernel-doc markups net: sit: unregister_netdevice on newlink's error path net: stmmac: Fixed mtu channged by cache aligned cxgb4/chtls: Fix tid stuck due to wrong update of qid i40e: fix potential NULL pointer dereferencing net: stmmac: use __napi_schedule() for PREEMPT_RT can: mcp251xfd: mcp251xfd_handle_rxif_one(): fix wrong NULL pointer check ...
…b/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: "One single fix to skip BPF selftests by default. BPF selftests have a hard dependency on cutting edge versions of tools in the BPF ecosystem including LLVM. Skipping BPF allows by default will make it easier for users interested in running kselftest as a whole. Users can include BPF in Kselftest build by via SKIP_TARGETS variable" * tag 'linux-kselftest-fixes-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests: Skip BPF seftests by default
… block-5.11 Pull NVMe fixes from Christoph: "nvme fixes for 5.11: - don't initialize hwmon for discover controllers (Sagi Grimberg) - fix iov_iter handling in nvme-tcp (Sagi Grimberg) - fix a preempt warning in nvme-tcp (Sagi Grimberg) - fix a possible NULL pointer dereference in nvme (Israel Rukshin)" * tag 'nvme-5.11-2021-01-14' of git://git.infradead.org/nvme: nvme: don't intialize hwmon for discovery controllers nvme-tcp: fix possible data corruption with bio merges nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY
…g/drm/drm-misc into drm-fixes Short summary of fixes pull: * dma-buf: Fix a memory leak in CMAV heap * drm: Fix format check for legacy pageflips * ttm: Pass correct address to dma_mapping_error(); Use mutex in pool shrinker Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/X/2iXO4ofFSZ39/v@linux-uq9g
This issue has generally been covered up by the presence of additional expansion ROMs after the ones we're interested in, with header fetches of subsequent images loading enough of the ROM to hide the issue. Noticed on GA102, which lacks a type 0x70 image compared to TU102,. [ 906.364197] nouveau 0000:09:00.0: bios: 00000000: type 00, 65024 bytes [ 906.381205] nouveau 0000:09:00.0: bios: 0000fe00: type 03, 91648 bytes [ 906.405213] nouveau 0000:09:00.0: bios: 00026400: type e0, 22016 bytes [ 906.410984] nouveau 0000:09:00.0: bios: 0002ba00: type e0, 366080 bytes vs [ 22.961901] nouveau 0000:09:00.0: bios: 00000000: type 00, 60416 bytes [ 22.984174] nouveau 0000:09:00.0: bios: 0000ec00: type 03, 71168 bytes [ 23.010446] nouveau 0000:09:00.0: bios: 00020200: type e0, 48128 bytes [ 23.028220] nouveau 0000:09:00.0: bios: 0002be00: type e0, 140800 bytes [ 23.080196] nouveau 0000:09:00.0: bios: 0004e400: type 70, 7168 bytes Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Whatever it is that we were doing before doesn't work on Ampere. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
No functional changes here yet. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
RM does this around transactions, and it seemed to help while debugging AUXCH issues on GA102. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Noticed while debugging GA102. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
GA100 hidden behind a module option, as it's not been as well verified since initial bring-up and may need additional changes. There's no display anyway, so this can wait for a bit. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
VRAM offset 0 is a valid address, triggered on GA102. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Appears to be compatible with GP100 code. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Forcing PRAMIN-shadowing off for GA100, as it requires display, and we don't know if/where the fuse register for detecting its presence is. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
VPLL regs changed a bit. There's more stuff to do around these, but it's less invasive to stick those changes into disp for now. None of that belongs here anymore anyhow - fix that someday. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Fortunately, all the interrupts we need to bring up basic display support are contained in a single leaf register, allowing this basic (but hackish) implementation. There's a bunch more invasive patches to come implementing all this in a better/more complete way, but trying to get a minimal series out first. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Appears to be compatible with GM200 code. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Appears to be compatible with NV50 code. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
switch to devm_extcon_register_notifier_all
arm64: boot: dts: qcom: sdm632-motorola-ocean: Add backlight
Kiciuk
pushed a commit
that referenced
this pull request
Feb 15, 2021
The bit that indicates if the device supports 160MHZ is bit #9. The macro checks bit #8. Fix IWL_SUBDEVICE_NO_160 macro to use the correct bit. Signed-off-by: Matti Gottlieb <matti.gottlieb@intel.com> Fixes: d6f2134 ("iwlwifi: add mac/rf types and 160MHz to the device tables") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/iwlwifi.20210122144849.bddbf9b57a75.I16e09e2b1404b16bfff70852a5a654aa468579e2@changeid
Kiciuk
pushed a commit
that referenced
this pull request
May 15, 2021
I got several memory leak reports from Asan with a simple command. It was because VDSO is not released due to the refcount. Like in __dsos_addnew_id(), it should put the refcount after adding to the list. $ perf record true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.030 MB perf.data (10 samples) ] ================================================================= ==692599==ERROR: LeakSanitizer: detected memory leaks Direct leak of 439 byte(s) in 1 object(s) allocated from: #0 0x7fea52341037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x559bce4aa8ee in dso__new_id util/dso.c:1256 #2 0x559bce59245a in __machine__addnew_vdso util/vdso.c:132 #3 0x559bce59245a in machine__findnew_vdso util/vdso.c:347 #4 0x559bce50826c in map__new util/map.c:175 #5 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787 #6 0x559bce512f6b in machines__deliver_event util/session.c:1481 #7 0x559bce515107 in perf_session__deliver_event util/session.c:1551 #8 0x559bce51d4d2 in do_flush util/ordered-events.c:244 #9 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323 #10 0x559bce519bea in __perf_session__process_events util/session.c:2268 #11 0x559bce519bea in perf_session__process_events util/session.c:2297 msm8953-mainline#12 0x559bce2e7a52 in process_buildids /home/namhyung/project/linux/tools/perf/builtin-record.c:1017 msm8953-mainline#13 0x559bce2e7a52 in record__finish_output /home/namhyung/project/linux/tools/perf/builtin-record.c:1234 msm8953-mainline#14 0x559bce2ed4f6 in __cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2026 msm8953-mainline#15 0x559bce2ed4f6 in cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2858 msm8953-mainline#16 0x559bce422db4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 msm8953-mainline#17 0x559bce2acac8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 msm8953-mainline#18 0x559bce2acac8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 msm8953-mainline#19 0x559bce2acac8 in main /home/namhyung/project/linux/tools/perf/perf.c:539 msm8953-mainline#20 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308 Indirect leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7fea52341037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x559bce520907 in nsinfo__copy util/namespaces.c:169 #2 0x559bce50821b in map__new util/map.c:168 #3 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787 #4 0x559bce512f6b in machines__deliver_event util/session.c:1481 #5 0x559bce515107 in perf_session__deliver_event util/session.c:1551 #6 0x559bce51d4d2 in do_flush util/ordered-events.c:244 #7 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323 #8 0x559bce519bea in __perf_session__process_events util/session.c:2268 #9 0x559bce519bea in perf_session__process_events util/session.c:2297 #10 0x559bce2e7a52 in process_buildids /home/namhyung/project/linux/tools/perf/builtin-record.c:1017 #11 0x559bce2e7a52 in record__finish_output /home/namhyung/project/linux/tools/perf/builtin-record.c:1234 msm8953-mainline#12 0x559bce2ed4f6 in __cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2026 msm8953-mainline#13 0x559bce2ed4f6 in cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2858 msm8953-mainline#14 0x559bce422db4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 msm8953-mainline#15 0x559bce2acac8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 msm8953-mainline#16 0x559bce2acac8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 msm8953-mainline#17 0x559bce2acac8 in main /home/namhyung/project/linux/tools/perf/perf.c:539 msm8953-mainline#18 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308 SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s). Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210315045641.700430-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kiciuk
pushed a commit
that referenced
this pull request
May 15, 2021
The following deadlock is detected: truncate -> setattr path is waiting for pending direct IO to be done (inode->i_dio_count become zero) with inode->i_rwsem held (down_write). PID: 14827 TASK: ffff881686a9af80 CPU: 20 COMMAND: "ora_p005_hrltd9" #0 __schedule at ffffffff818667cc #1 schedule at ffffffff81866de6 #2 inode_dio_wait at ffffffff812a2d04 #3 ocfs2_setattr at ffffffffc05f322e [ocfs2] #4 notify_change at ffffffff812a5a09 #5 do_truncate at ffffffff812808f5 #6 do_sys_ftruncate.constprop.18 at ffffffff81280cf2 #7 sys_ftruncate at ffffffff81280d8e #8 do_syscall_64 at ffffffff81003949 #9 entry_SYSCALL_64_after_hwframe at ffffffff81a001ad dio completion path is going to complete one direct IO (decrement inode->i_dio_count), but before that it hung at locking inode->i_rwsem: #0 __schedule+700 at ffffffff818667cc #1 schedule+54 at ffffffff81866de6 #2 rwsem_down_write_failed+536 at ffffffff8186aa28 #3 call_rwsem_down_write_failed+23 at ffffffff8185a1b7 #4 down_write+45 at ffffffff81869c9d #5 ocfs2_dio_end_io_write+180 at ffffffffc05d5444 [ocfs2] #6 ocfs2_dio_end_io+85 at ffffffffc05d5a85 [ocfs2] #7 dio_complete+140 at ffffffff812c873c #8 dio_aio_complete_work+25 at ffffffff812c89f9 #9 process_one_work+361 at ffffffff810b1889 #10 worker_thread+77 at ffffffff810b233d #11 kthread+261 at ffffffff810b7fd5 msm8953-mainline#12 ret_from_fork+62 at ffffffff81a0035e Thus above forms ABBA deadlock. The same deadlock was mentioned in upstream commit 28f5a8a ("ocfs2: should wait dio before inode lock in ocfs2_setattr()"). It seems that that commit only removed the cluster lock (the victim of above dead lock) from the ABBA deadlock party. End-user visible effects: Process hang in truncate -> ocfs2_setattr path and other processes hang at ocfs2_dio_end_io_write path. This is to fix the deadlock itself. It removes inode_lock() call from dio completion path to remove the deadlock and add ip_alloc_sem lock in setattr path to synchronize the inode modifications. [wen.gang.wang@oracle.com: remove the "had_alloc_lock" as suggested] Link: https://lkml.kernel.org/r/20210402171344.1605-1-wen.gang.wang@oracle.com Link: https://lkml.kernel.org/r/20210331203654.3911-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kiciuk
pushed a commit
that referenced
this pull request
May 22, 2021
commit 90bd070 upstream. The following deadlock is detected: truncate -> setattr path is waiting for pending direct IO to be done (inode->i_dio_count become zero) with inode->i_rwsem held (down_write). PID: 14827 TASK: ffff881686a9af80 CPU: 20 COMMAND: "ora_p005_hrltd9" #0 __schedule at ffffffff818667cc #1 schedule at ffffffff81866de6 #2 inode_dio_wait at ffffffff812a2d04 #3 ocfs2_setattr at ffffffffc05f322e [ocfs2] #4 notify_change at ffffffff812a5a09 #5 do_truncate at ffffffff812808f5 #6 do_sys_ftruncate.constprop.18 at ffffffff81280cf2 #7 sys_ftruncate at ffffffff81280d8e #8 do_syscall_64 at ffffffff81003949 #9 entry_SYSCALL_64_after_hwframe at ffffffff81a001ad dio completion path is going to complete one direct IO (decrement inode->i_dio_count), but before that it hung at locking inode->i_rwsem: #0 __schedule+700 at ffffffff818667cc #1 schedule+54 at ffffffff81866de6 #2 rwsem_down_write_failed+536 at ffffffff8186aa28 #3 call_rwsem_down_write_failed+23 at ffffffff8185a1b7 #4 down_write+45 at ffffffff81869c9d #5 ocfs2_dio_end_io_write+180 at ffffffffc05d5444 [ocfs2] #6 ocfs2_dio_end_io+85 at ffffffffc05d5a85 [ocfs2] #7 dio_complete+140 at ffffffff812c873c #8 dio_aio_complete_work+25 at ffffffff812c89f9 #9 process_one_work+361 at ffffffff810b1889 #10 worker_thread+77 at ffffffff810b233d #11 kthread+261 at ffffffff810b7fd5 msm8953-mainline#12 ret_from_fork+62 at ffffffff81a0035e Thus above forms ABBA deadlock. The same deadlock was mentioned in upstream commit 28f5a8a ("ocfs2: should wait dio before inode lock in ocfs2_setattr()"). It seems that that commit only removed the cluster lock (the victim of above dead lock) from the ABBA deadlock party. End-user visible effects: Process hang in truncate -> ocfs2_setattr path and other processes hang at ocfs2_dio_end_io_write path. This is to fix the deadlock itself. It removes inode_lock() call from dio completion path to remove the deadlock and add ip_alloc_sem lock in setattr path to synchronize the inode modifications. [wen.gang.wang@oracle.com: remove the "had_alloc_lock" as suggested] Link: https://lkml.kernel.org/r/20210402171344.1605-1-wen.gang.wang@oracle.com Link: https://lkml.kernel.org/r/20210331203654.3911-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kiciuk
pushed a commit
that referenced
this pull request
Aug 9, 2021
Loading and then unloading module g_dpgp on a VM that does not support the driver currently throws a WARN_ON message because the port has not been initialized. Removing an unused driver is a valid use-case and the WARN_ON kernel warning is a bit excessive, so remove it. Cleans up: [27654.638698] ------------[ cut here ]------------ [27654.638705] WARNING: CPU: 6 PID: 2956336 at drivers/usb/gadget/function/u_serial.c:1201 gserial_free_line+0x7c/0x90 [u_serial] [27654.638728] Modules linked in: g_dbgp(-) u_serial usb_f_tcm target_core_mod libcomposite udc_core vmw_vmci mcb i2c_nforce2 i2c_amd756 nfit cx8800 videobuf2_dma_sg videobuf2_memops videobuf2_v4l2 cx88xx tveeprom videobuf2_common videodev mc ccp hid_generic hid intel_ishtp cros_ec mc13xxx_core vfio_mdev mdev i915 i2c_algo_bit kvm ppdev parport zatm eni suni uPD98402 atm rio_scan binder_linux hwmon_vid video ipmi_devintf ipmi_msghandler zstd nls_utf8 decnet qrtr ns sctp ip6_udp_tunnel udp_tunnel fcrypt pcbc nhc_udp nhc_ipv6 nhc_routing nhc_mobility nhc_hop nhc_dest nhc_fragment 6lowpan ts_kmp dccp_ipv6 dccp_ipv4 dccp snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq_dummy snd_seq snd_seq_device xen_front_pgdir_shbuf binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common snd_hda_codec_generic ledtrig_audio snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd rapl soundcore joydev input_leds mac_hid serio_raw efi_pstore [27654.638880] qemu_fw_cfg sch_fq_codel msr virtio_rng autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl drm_ttm_helper crct10dif_pclmul ttm drm_kms_helper syscopyarea sysfillrect sysimgblt virtio_net fb_sys_fops cec net_failover rc_core ahci psmouse drm libahci lpc_ich virtio_blk failover [last unloaded: u_ether] [27654.638949] CPU: 6 PID: 2956336 Comm: modprobe Tainted: P O 5.13.0-9-generic #9 [27654.638956] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [27654.638969] RIP: 0010:gserial_free_line+0x7c/0x90 [u_serial] [27654.638981] Code: 20 00 00 00 00 e8 74 1a ba c9 4c 89 e7 e8 8c fe ff ff 48 8b 3d 75 3b 00 00 44 89 f6 e8 3d 7c 69 c9 5b 41 5c 41 5d 41 5e 5d c3 <0f> 0b 4c 89 ef e8 4a 1a ba c9 5b 41 5c 41 5d 41 5e 5d c3 90 0f 1f [27654.638986] RSP: 0018:ffffba0b81403da0 EFLAGS: 00010246 [27654.638992] RAX: 0000000000000000 RBX: ffffffffc0eaf6a0 RCX: 0000000000000000 [27654.638996] RDX: ffff8e21c0cac8c0 RSI: 0000000000000006 RDI: ffffffffc0eaf6a0 [27654.639000] RBP: ffffba0b81403dc0 R08: ffffba0b81403de0 R09: fefefefefefefeff [27654.639003] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [27654.639006] R13: ffffffffc0eaf6a0 R14: 0000000000000000 R15: 0000000000000000 [27654.639010] FS: 00007faa1935e740(0000) GS:ffff8e223bd80000(0000) knlGS:0000000000000000 [27654.639015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27654.639019] CR2: 00007ffc840cd4e8 CR3: 000000000e1ac006 CR4: 0000000000370ee0 [27654.639028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [27654.639031] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [27654.639035] Call Trace: [27654.639044] dbgp_exit+0x1c/0xa1a [g_dbgp] [27654.639054] __do_sys_delete_module.constprop.0+0x144/0x260 [27654.639066] ? call_rcu+0xe/0x10 [27654.639073] __x64_sys_delete_module+0x12/0x20 [27654.639081] do_syscall_64+0x61/0xb0 [27654.639092] ? exit_to_user_mode_loop+0xec/0x160 [27654.639098] ? exit_to_user_mode_prepare+0x37/0xb0 [27654.639104] ? syscall_exit_to_user_mode+0x27/0x50 [27654.639110] ? __x64_sys_close+0x12/0x40 [27654.639119] ? do_syscall_64+0x6e/0xb0 [27654.639126] ? exit_to_user_mode_prepare+0x37/0xb0 [27654.639132] ? syscall_exit_to_user_mode+0x27/0x50 [27654.639137] ? __x64_sys_newfstatat+0x1e/0x20 [27654.639146] ? do_syscall_64+0x6e/0xb0 [27654.639154] ? exc_page_fault+0x8f/0x170 [27654.639159] ? asm_exc_page_fault+0x8/0x30 [27654.639166] entry_SYSCALL_64_after_hwframe+0x44/0xae [27654.639173] RIP: 0033:0x7faa194a4b2b [27654.639179] Code: 73 01 c3 48 8b 0d 3d 73 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 73 0c 00 f7 d8 64 89 01 48 [27654.639185] RSP: 002b:00007ffc840d0578 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [27654.639191] RAX: ffffffffffffffda RBX: 000056060f9f4e70 RCX: 00007faa194a4b2b [27654.639194] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056060f9f4ed8 [27654.639197] RBP: 000056060f9f4e70 R08: 0000000000000000 R09: 0000000000000000 [27654.639200] R10: 00007faa1951eac0 R11: 0000000000000206 R12: 000056060f9f4ed8 [27654.639203] R13: 0000000000000000 R14: 000056060f9f4ed8 R15: 00007ffc840d06c8 [27654.639219] ---[ end trace 8dd0ea0bb32ce94a ]--- Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210701144305.110078-1-colin.king@canonical.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kiciuk
pushed a commit
that referenced
this pull request
Aug 9, 2021
Enabling the framebuffer leads to a system hang. Running, as a debug hack, the store_pan() function in drivers/video/fbdev/core/fbsysfs.c without taking the console_lock, allows to see the crash backtrace on the serial line. ~ # echo 0 0 > /sys/class/graphics/fb0/pan [ 9.719414] Unhandled exception: IPSR = 00000005 LR = fffffff1 [ 9.726937] CPU: 0 PID: 49 Comm: sh Not tainted 5.13.0-rc5 #9 [ 9.733008] Hardware name: STM32 (Device Tree Support) [ 9.738296] PC is at clk_gate_is_enabled+0x0/0x28 [ 9.743426] LR is at stm32f4_pll_div_set_rate+0xf/0x38 [ 9.748857] pc : [<0011e4be>] lr : [<0011f9e3>] psr: 0100000b [ 9.755373] sp : 00bc7be0 ip : 00000000 fp : 001f3ac4 [ 9.760812] r10: 002610d0 r9 : 01efe920 r8 : 00540560 [ 9.766269] r7 : 02e7ddb0 r6 : 0173eed8 r5 : 00000000 r4 : 004027c0 [ 9.773081] r3 : 0011e4bf r2 : 02e7ddb0 r1 : 0173eed8 r0 : 1d3267b8 [ 9.779911] xPSR: 0100000b [ 9.782719] CPU: 0 PID: 49 Comm: sh Not tainted 5.13.0-rc5 #9 [ 9.788791] Hardware name: STM32 (Device Tree Support) [ 9.794120] [<0000afa1>] (unwind_backtrace) from [<0000a33f>] (show_stack+0xb/0xc) [ 9.802421] [<0000a33f>] (show_stack) from [<0000a8df>] (__invalid_entry+0x4b/0x4c) The `pll_num' field in the post_div_data configuration contained a wrong value which also referenced an uninitialized hardware clock when clk_register_pll_div() was called. Fixes: 517633e ("clk: stm32f4: Add post divisor for I2S & SAI PLLs") Signed-off-by: Dario Binacchi <dariobin@libero.it> Reviewed-by: Gabriel Fernandez <gabriel.fernandez@st.com> Link: https://lore.kernel.org/r/20210725160725.10788-1-dariobin@libero.it Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Kiciuk
pushed a commit
that referenced
this pull request
Oct 9, 2021
It's later supposed to be either a correct address or NULL. Without the initialization, it may contain an undefined value which results in the following segmentation fault: # perf top --sort comm -g --ignore-callees=do_idle terminates with: #0 0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6 #1 0x00007ffff55e3802 in strdup () from /lib64/libc.so.6 #2 0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489 #3 hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564 #4 0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420, sample_self=sample_self@entry=true) at util/hist.c:657 #5 0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0, sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288 #6 0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38) at util/hist.c:1056 #7 iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056 #8 0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0) at util/hist.c:1231 #9 0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0) at builtin-top.c:842 #10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202 #11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244 msm8953-mainline#12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323 msm8953-mainline#13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339 msm8953-mainline#14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341 msm8953-mainline#15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339 msm8953-mainline#16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114 msm8953-mainline#17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0 msm8953-mainline#18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6 If you look at the frame #2, the code is: 488 if (he->srcline) { 489 he->srcline = strdup(he->srcline); 490 if (he->srcline == NULL) 491 goto err_rawdata; 492 } If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish), it gets strdupped and strdupping a rubbish random string causes the problem. Also, if you look at the commit 1fb7d06, it adds the srcline property into the struct, but not initializing it everywhere needed. Committer notes: Now I see, when using --ignore-callees=do_idle we end up here at line 2189 in add_callchain_ip(): 2181 if (al.sym != NULL) { 2182 if (perf_hpp_list.parent && !*parent && 2183 symbol__match_regex(al.sym, &parent_regex)) 2184 *parent = al.sym; 2185 else if (have_ignore_callees && root_al && 2186 symbol__match_regex(al.sym, &ignore_callees_regex)) { 2187 /* Treat this symbol as the root, 2188 forgetting its callees. */ 2189 *root_al = al; 2190 callchain_cursor_reset(cursor); 2191 } 2192 } And the al that doesn't have the ->srcline field initialized will be copied to the root_al, so then, back to: 1211 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al, 1212 int max_stack_depth, void *arg) 1213 { 1214 int err, err2; 1215 struct map *alm = NULL; 1216 1217 if (al) 1218 alm = map__get(al->map); 1219 1220 err = sample__resolve_callchain(iter->sample, &callchain_cursor, &iter->parent, 1221 iter->evsel, al, max_stack_depth); 1222 if (err) { 1223 map__put(alm); 1224 return err; 1225 } 1226 1227 err = iter->ops->prepare_entry(iter, al); 1228 if (err) 1229 goto out; 1230 1231 err = iter->ops->add_single_entry(iter, al); 1232 if (err) 1233 goto out; 1234 That al at line 1221 is what hist_entry_iter__add() (called from sample__resolve_callchain()) saw as 'root_al', and then: iter->ops->add_single_entry(iter, al); will go on with al->srcline with a bogus value, I'll add the above sequence to the cset and apply, thanks! Signed-off-by: Michael Petlan <mpetlan@redhat.com> CC: Milian Wolff <milian.wolff@kdab.com> Cc: Jiri Olsa <jolsa@redhat.com> Fixes: 1fb7d06 ("perf report Use srcline from callchain for hist entries") Link: https //lore.kernel.org/r/20210719145332.29747-1-mpetlan@redhat.com Reported-by: Juri Lelli <jlelli@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kiciuk
pushed a commit
that referenced
this pull request
Oct 9, 2021
FD uses xyarray__entry that may return NULL if an index is out of bounds. If NULL is returned then a segv happens as FD unconditionally dereferences the pointer. This was happening in a case of with perf iostat as shown below. The fix is to make FD an "int*" rather than an int and handle the NULL case as either invalid input or a closed fd. $ sudo gdb --args perf stat --iostat list ... Breakpoint 1, perf_evsel__alloc_fd (evsel=0x5555560951a0, ncpus=1, nthreads=1) at evsel.c:50 50 { (gdb) bt #0 perf_evsel__alloc_fd (evsel=0x5555560951a0, ncpus=1, nthreads=1) at evsel.c:50 #1 0x000055555585c188 in evsel__open_cpu (evsel=0x5555560951a0, cpus=0x555556093410, threads=0x555556086fb0, start_cpu=0, end_cpu=1) at util/evsel.c:1792 #2 0x000055555585cfb2 in evsel__open (evsel=0x5555560951a0, cpus=0x0, threads=0x555556086fb0) at util/evsel.c:2045 #3 0x000055555585d0db in evsel__open_per_thread (evsel=0x5555560951a0, threads=0x555556086fb0) at util/evsel.c:2065 #4 0x00005555558ece64 in create_perf_stat_counter (evsel=0x5555560951a0, config=0x555555c34700 <stat_config>, target=0x555555c2f1c0 <target>, cpu=0) at util/stat.c:590 #5 0x000055555578e927 in __run_perf_stat (argc=1, argv=0x7fffffffe4a0, run_idx=0) at builtin-stat.c:833 #6 0x000055555578f3c6 in run_perf_stat (argc=1, argv=0x7fffffffe4a0, run_idx=0) at builtin-stat.c:1048 #7 0x0000555555792ee5 in cmd_stat (argc=1, argv=0x7fffffffe4a0) at builtin-stat.c:2534 #8 0x0000555555835ed3 in run_builtin (p=0x555555c3f540 <commands+288>, argc=3, argv=0x7fffffffe4a0) at perf.c:313 #9 0x0000555555836154 in handle_internal_command (argc=3, argv=0x7fffffffe4a0) at perf.c:365 #10 0x000055555583629f in run_argv (argcp=0x7fffffffe2ec, argv=0x7fffffffe2e0) at perf.c:409 #11 0x0000555555836692 in main (argc=3, argv=0x7fffffffe4a0) at perf.c:539 ... (gdb) c Continuing. Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (uncore_iio_0/event=0x83,umask=0x04,ch_mask=0xF,fc_mask=0x07/). /bin/dmesg | grep -i perf may provide additional information. Program received signal SIGSEGV, Segmentation fault. 0x00005555559b03ea in perf_evsel__close_fd_cpu (evsel=0x5555560951a0, cpu=1) at evsel.c:166 166 if (FD(evsel, cpu, thread) >= 0) v3. fixes a bug in perf_evsel__run_ioctl where the sense of a branch was backward. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210918054440.2350466-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kiciuk
pushed a commit
that referenced
this pull request
Feb 26, 2022
Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3. This series attempts to optimize and streamline the COW logic for ordinary anon pages and THP anon pages, fixing two remaining instances of CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information can leak from a parent process to a child process via anonymous pages shared during fork(). This issue, including other related COW issues, has been summarized in [2]: " 1. Observing Memory Modifications of Private Pages From A Child Process Long story short: process-private memory might not be as private as you think once you fork(): successive modifications of private memory regions in the parent process can still be observed by the child process, for example, by smart use of vmsplice()+munmap(). The core problem is that pinning pages readable in a child process, such as done via the vmsplice system call, can result in a child process observing memory modifications done in the parent process the child is not supposed to observe. [1] contains an excellent summary and [2] contains further details. This issue was assigned CVE-2020-29374 [9]. For this to trigger, it's required to use a fork() without subsequent exec(), for example, as used under Android zygote. Without further details about an application that forks less-privileged child processes, one cannot really say what's actually affected and what's not -- see the details section the end of this mail for a short sshd/openssh analysis. While commit 1783985 ("gup: document and work around "COW can break either way" issue") fixed this issue and resulted in other problems (e.g., ptrace on pmem), commit 09854ba ("mm: do_wp_page() simplification") re-introduced part of the problem unfortunately. The original reproducer can be modified quite easily to use THP [3] and make the issue appear again on upstream kernels. I modified it to use hugetlb [4] and it triggers as well. The problem is certainly less severe with hugetlb than with THP; it merely highlights that we still have plenty of open holes we should be closing/fixing. Regarding vmsplice(), the only known workaround is to disallow the vmsplice() system call ... or disable THP and hugetlb. But who knows what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in the end, it's a more generic issue. " This security issue was first reported by Jann Horn on 27 May 2020 and it currently affects anonymous pages during swapin, anonymous THP and hugetlb. This series tackles anonymous pages during swapin and anonymous THP: * do_swap_page() for handling COW on PTEs during swapin directly * do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write faults With this series, we'll apply the same COW logic we have in do_wp_page() to all swappable anon pages: don't reuse (map writable) the page in case there are additional references (page_count() != 1). All users of reuse_swap_page() are remove, and consequently reuse_swap_page() is removed. In general, we're struggling with the following COW-related issues: (1) "missed COW": we miss to copy on write and reuse the page (map it writable) although we must copy because there are pending references from another process to this page. The result is a security issue. (2) "wrong COW": we copy on write although we wouldn't have to and shouldn't: if there are valid GUP references, they will become out of sync with the pages mapped into the page table. We fail to detect that such a page can be reused safely, especially if never more than a single process mapped the page. The result is an intra process memory corruption. (3) "unnecessary COW": we copy on write although we wouldn't have to: performance degradation and temporary increases swap+memory consumption can be the result. While this series fixes (1) for swappable anon pages, it tries to reduce reported cases of (3) first as good and easy as possible to limit the impact when streamlining. The individual patches try to describe in which cases we will run into (3). This series certainly makes (2) worse for THP, because a THP will now get PTE-mapped on write faults if there are additional references, even if there was only ever a single process involved: once PTE-mapped, we'll copy each and every subpage and won't reuse any subpage as long as the underlying compound page wasn't split. I'm working on an approach to fix (2) and improve (3): PageAnonExclusive to mark anon pages that are exclusive to a single process, allow GUP pins only on such exclusive pages, and allow turning exclusive pages shared (clearing PageAnonExclusive) only if there are no GUP pins. Anon pages with PageAnonExclusive set never have to be copied during write faults, but eventually during fork() if they cannot be turned shared. The improved reuse logic in this series will essentially also be the logic to reset PageAnonExclusive. This work will certainly take a while, but I'm planning on sharing details before having code fully ready. #1-#5 can be applied independently of the rest. #6-#9 are mostly only cleanups related to reuse_swap_page(). Notes: * For now, I'll leave hugetlb code untouched: "unnecessary COW" might easily break existing setups because hugetlb pages are a scarce resource and we could just end up having to crash the application when we run out of hugetlb pages. We have to be very careful and the security aspect with hugetlb is most certainly less relevant than for unprivileged anon pages. * Instead of lru_add_drain() we might actually just drain the lru_add list or even just remove the single page of interest from the lru_add list. This would require a new helper function, and could be added if the conditional lru_add_drain() turn out to be a problem. * I extended the test case already included in [1] to also test for the newly found do_swap_page() case. I'll send that out separately once/if this part was merged. [1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com [2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com This patch (of 9): Liang Zhang reported [1] that the current COW logic in do_wp_page() is sub-optimal when it comes to swap+read fault+write fault of anonymous pages that have a single user, visible via a performance degradation in the redis benchmark. Something similar was previously reported [2] by Nadav with a simple reproducer. After we put an anon page into the swapcache and unmapped it from a single process, that process might read that page again and refault it read-only. If that process then writes to that page, the process is actually the exclusive user of the page, however, the COW logic in do_co_page() won't be able to reuse it due to the additional reference from the swapcache. Let's optimize for pages that have been added to the swapcache but only have an exclusive user. Try removing the swapcache reference if there is hope that we're the exclusive user. We will fail removing the swapcache reference in two scenarios: (1) There are additional swap entries referencing the page: copying instead of reusing is the right thing to do. (2) The page is under writeback: theoretically we might be able to reuse in some cases, however, we cannot remove the additional reference and will have to copy. Note that we'll only try removing the page from the swapcache when it's highly likely that we'll be the exclusive owner after removing the page from the swapache. As we're about to map that page writable and redirty it, that should not affect reclaim but is rather the right thing to do. Further, we might have additional references from the LRU pagevecs, which will force us to copy instead of being able to reuse. We'll try handling such references for some scenarios next. Concurrent writeback cannot be handled easily and we'll always have to copy. While at it, remove the superfluous page_mapcount() check: it's implicitly covered by the page_count() for ordinary anon pages. [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com Link: https://lkml.kernel.org/r/20220131162940.210846-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Liang Zhang <zhangliang5@huawei.com> Reported-by: Nadav Amit <nadav.amit@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Brown <broonie@kernel.org>
Kiciuk
pushed a commit
that referenced
this pull request
Nov 26, 2022
Time stamps are added to the output in kernels built with CONFIG_PRINTK_TIME=y, which causes misaligned output. Therefore, replace pr_cont() with pr_err(), which fixes alignment and gets rid of a couple of despised pr_cont() calls. Before: [ 37.567343] rcu: INFO: rcu_preempt self-detected stall on CPU [ 37.567839] rcu: 0-....: (1500 ticks this GP) idle=*** [ 37.568270] (t=1501 jiffies g=4717 q=28 ncpus=4) [ 37.568668] CPU: 0 PID: 313 Comm: test0 Not tainted 6.1.0-rc4 #8 After: [ 36.762074] rcu: INFO: rcu_preempt self-detected stall on CPU [ 36.762543] rcu: 0-....: (1499 ticks this GP) idle=*** [ 36.763003] rcu: (t=1500 jiffies g=5097 q=27 ncpus=4) [ 36.763522] CPU: 0 PID: 313 Comm: test0 Not tainted 6.1.0-rc4 #9 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Kiciuk
pushed a commit
that referenced
this pull request
Mar 13, 2023
It seems that commit bc3c5e0 ("drm/i915/sseu: Don't try to store EU mask internally in UAPI format") exposed a potential out-of-bounds access, reported by UBSAN as following on a laptop with a gen 11 i915 card: UBSAN: array-index-out-of-bounds in drivers/gpu/drm/i915/gt/intel_sseu.c:65:27 index 6 is out of range for type 'u16 [6]' CPU: 2 PID: 165 Comm: systemd-udevd Not tainted 6.2.0-9-generic #9-Ubuntu Hardware name: Dell Inc. XPS 13 9300/077Y9N, BIOS 1.11.0 03/22/2022 Call Trace: <TASK> show_stack+0x4e/0x61 dump_stack_lvl+0x4a/0x6f dump_stack+0x10/0x18 ubsan_epilogue+0x9/0x3a __ubsan_handle_out_of_bounds.cold+0x42/0x47 gen11_compute_sseu_info+0x121/0x130 [i915] intel_sseu_info_init+0x15d/0x2b0 [i915] intel_gt_init_mmio+0x23/0x40 [i915] i915_driver_mmio_probe+0x129/0x400 [i915] ? intel_gt_probe_all+0x91/0x2e0 [i915] i915_driver_probe+0xe1/0x3f0 [i915] ? drm_privacy_screen_get+0x16d/0x190 [drm] ? acpi_dev_found+0x64/0x80 i915_pci_probe+0xac/0x1b0 [i915] ... According to the definition of sseu_dev_info, eu_mask->hsw is limited to a maximum of GEN_MAX_SS_PER_HSW_SLICE (6) sub-slices, but gen11_sseu_info_init() can potentially set 8 sub-slices, in the !IS_JSL_EHL(gt->i915) case. Fix this by reserving up to 8 slots for max_subslices in the eu_mask struct. Reported-by: Emil Renner Berthing <emil.renner.berthing@canonical.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Fixes: bc3c5e0 ("drm/i915/sseu: Don't try to store EU mask internally in UAPI format") Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230220171858.131416-1-andrea.righi@canonical.com (cherry picked from commit 3cba09a) Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Kiciuk
pushed a commit
that referenced
this pull request
Jul 24, 2023
Noticed with: make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin Direct leak of 45 byte(s) in 1 object(s) allocated from: #0 0x7f213f87243b in strdup (/lib64/libasan.so.8+0x7243b) #1 0x63d15f in evsel__set_filter util/evsel.c:1371 #2 0x63d15f in evsel__append_filter util/evsel.c:1387 #3 0x63d15f in evsel__append_tp_filter util/evsel.c:1400 #4 0x62cd52 in evlist__append_tp_filter util/evlist.c:1145 #5 0x62cd52 in evlist__append_tp_filter_pids util/evlist.c:1196 #6 0x541e49 in trace__set_filter_loop_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3646 #7 0x541e49 in trace__set_filter_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3670 #8 0x541e49 in trace__run /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3970 #9 0x541e49 in cmd_trace /home/acme/git/perf-tools/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools/tools/perf/perf.c:377 msm8953-mainline#12 0x4196da in run_argv /home/acme/git/perf-tools/tools/perf/perf.c:421 msm8953-mainline#13 0x4196da in main /home/acme/git/perf-tools/tools/perf/perf.c:537 msm8953-mainline#14 0x7f213e84a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Free it on evsel__exit(). Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/20230719202951.534582-2-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kiciuk
pushed a commit
that referenced
this pull request
Jul 24, 2023
To plug these leaks detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==473890==ERROR: LeakSanitizer: detected memory leaks Direct leak of 112 byte(s) in 1 object(s) allocated from: #0 0x7fdf19aba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987836 in zalloc (/home/acme/bin/perf+0x987836) #2 0x5367ae in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1289 #3 0x5367ae in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307 #4 0x5367ae in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468 #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177 #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685 #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712 #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055 #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 msm8953-mainline#12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 msm8953-mainline#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 msm8953-mainline#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Direct leak of 2048 byte(s) in 1 object(s) allocated from: #0 0x7f788fcba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af) #1 0x5337c0 in trace__sys_enter /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2342 #2 0x52bfb4 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3191 #3 0x52bfb4 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3699 #4 0x542883 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3726 #5 0x542883 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4069 #6 0x542883 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5155 #7 0x5ef232 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #8 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #9 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #10 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 #11 0x7f788ec4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Indirect leak of 48 byte(s) in 1 object(s) allocated from: #0 0x7fdf19aba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af) #1 0x77b335 in intlist__new util/intlist.c:116 #2 0x5367fd in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1293 #3 0x5367fd in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307 #4 0x5367fd in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468 #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177 #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685 #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712 #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055 #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 msm8953-mainline#12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 msm8953-mainline#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 msm8953-mainline#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/20230719202951.534582-4-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kiciuk
pushed a commit
that referenced
this pull request
Jul 24, 2023
In 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") it only was freeing if strcmp(evsel->tp_format->system, "syscalls") returned zero, while the corresponding initialization of evsel->priv was being performed if it was _not_ zero, i.e. if the tp system wasn't 'syscalls'. Just stop looking for that and free it if evsel->priv was set, which should be equivalent. Also use the pre-existing evsel_trace__delete() function. This resolves these leaks, detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==481565==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 #6 0x540e8b in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3212 #7 0x540e8b in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 #8 0x540e8b in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 msm8953-mainline#12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 msm8953-mainline#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 #6 0x540dd1 in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3205 #7 0x540dd1 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 #8 0x540dd1 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 msm8953-mainline#12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 msm8953-mainline#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) SUMMARY: AddressSanitizer: 80 byte(s) leaked in 2 allocation(s). [root@quaco ~]# With this we plug all leaks with "perf trace sleep 1". Fixes: 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Riccardo Mancini <rickyman7@gmail.com> Link: https://lore.kernel.org/lkml/20230719202951.534582-5-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kiciuk
pushed a commit
that referenced
this pull request
Jul 24, 2023
Petr Machata says: ==================== mlxsw: Permit enslavement to netdevices with uppers The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. Similarly, detaching a front panel port from a configured topology means unoffloading of this whole topology -- VLAN uppers, next hops, etc. Attaching the port back is then not permitted at all. If it were, it would not result in a working configuration, because much of mlxsw is written to react to changes in immediate configuration. There is nothing that would go visit netdevices in the attached-to topology and offload existing routes and VLAN memberships, for example. In this patchset, introduce a number of replays to be invoked so that this sort of post-hoc offload is supported. Then remove the vetoes that disallowed enslavement of front panel ports to other netdevices with uppers. The patchset progresses as follows: - In patch #1, fix an issue in the bridge driver. To my knowledge, the issue could not have resulted in a buggy behavior previously, and thus is packaged with this patchset instead of being sent separately to net. - In patch #2, add a new helper to the switchdev code. - In patch #3, drop mlxsw selftests that will not be relevant after this patchset anymore. - Patches #4, #5, #6, #7 and #8 prepare the codebase for smoother introduction of the rest of the code. - Patches #9, #10, #11, msm8953-mainline#12, msm8953-mainline#13 and msm8953-mainline#14 replay various aspects of upper configuration when a front panel port is introduced into a topology. Individual patches take care of bridge and LAG RIF memberships, switchdev replay, nexthop and neighbors replay, and MACVLAN offload. - Patches msm8953-mainline#15 and msm8953-mainline#16 introduce RIFs for newly-relevant netdevices when a front panel port is enslaved (in which case all uppers are newly relevant), or, respectively, deslaved (in which case the newly-relevant netdevice is the one being deslaved). - Up until this point, the introduced scaffolding was not really used, because mlxsw still forbids enslavement of mlxsw netdevices to uppers with uppers. In patch msm8953-mainline#17, this condition is finally relaxed. A sizable selftest suite is available to test all this new code. That will be sent in a separate patchset. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
Kiciuk
pushed a commit
that referenced
this pull request
Nov 4, 2023
Ido Schimmel says: ==================== Add MDB get support This patchset adds MDB get support, allowing user space to request a single MDB entry to be retrieved instead of dumping the entire MDB. Support is added in both the bridge and VXLAN drivers. Patches #1-#6 are small preparations in both drivers. Patches #7-#8 add the required uAPI attributes for the new functionality and the MDB get net device operation (NDO), respectively. Patches #9-#10 implement the MDB get NDO in both drivers. Patch #11 registers a handler for RTM_GETMDB messages in rtnetlink core. The handler derives the net device from the ifindex specified in the ancillary header and invokes its MDB get NDO. Patches msm8953-mainline#12-msm8953-mainline#13 add selftests by converting tests that use MDB dump with grep to the new MDB get functionality. iproute2 changes can be found here [1]. v2: * Patch #7: Add a comment to describe attributes structure. * Patch #9: Add a comment above spin_lock_bh(). [1] https://github.com/idosch/iproute2/tree/submit/mdb_get_v1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
Kiciuk
pushed a commit
that referenced
this pull request
Dec 3, 2023
Petr Machata says: ==================== mlxsw: Support CFF flood mode The registers to configure to initialize a flood table differ between the controlled and CFF flood modes. In therefore needs to be an op. Add it, hook up the current init to the existing families, and invoke the op. PGT is an in-HW table that maps addresses to sets of ports. Then when some HW process needs a set of ports as an argument, instead of embedding the actual set in the dynamic configuration, what gets configured is the address referencing the set. The HW then works with the appropriate PGT entry. Among other allocations, the PGT currently contains two large blocks for bridge flooding: one for 802.1q and one for 802.1d. Within each of these blocks are three tables, for unknown-unicast, multicast and broadcast flooding: . . . | 802.1q | 802.1d | . . . | UC | MC | BC | UC | MC | BC | \______ _____/ \_____ ______/ v v FID flood vectors Thus each FID (which corresponds to an 802.1d bridge or one VLAN in an 802.1q bridge) uses three flood vectors spread across a fairly large region of PGT. This way of organizing the flood table (called "controlled") is not very flexible. E.g. to decrease a bridge scale and store more IP MC vectors, one would need to completely rewrite the bridge PGT blocks, or resort to hacks such as storing individual MC flood vectors into unused part of the bridge table. In order to address these shortcomings, Spectrum-2 and above support what is called CFF flood mode, for Compressed FID Flooding. In CFF flood mode, each FID has a little table of its own, with three entries adjacent to each other, one for unknown-UC, one for MC, one for BC. This allows for a much more fine-grained approach to PGT management, where bits of it are allocated on demand. . . . | FID | FID | FID | FID | FID | . . . |U|M|B|U|M|B|U|M|B|U|M|B|U|M|B| \_____________ _____________/ v FID flood vectors Besides the FID table organization, the CFF flood mode also impacts Router Subport (RSP) table. This table contains flood vectors for rFIDs, which are FIDs that reference front panel ports or LAGs. The RSP table contains two entries per front panel port and LAG, one for unknown-UC traffic, and one for everything else. Currently, the FW allocates and manages the table in its own part of PGT. rFIDs are marked with flood_rsp bit and managed specially. In CFF mode, rFIDs are managed as all other FIDs. The driver therefore has to allocate and maintain the flood vectors. Like with bridge FIDs, this is more work, but increases flexibility of the system. The FW currently supports both the controlled and CFF flood modes. To shed complexity, in the future it should only support CFF flood mode. Hence this patchset, which adds CFF flood mode support to mlxsw. Since mlxsw needs to maintain both the controlled mode as well as CFF mode support, we will keep the layout as compatible as possible. The bridge tables will stay in the same overall shape, just their inner organization will change from flood mode -> FID to FID -> flood mode. Likewise will RSP be kept as a contiguous block of PGT memory, as was the case when the FW maintained it. - The way FIDs get configured under the CFF flood mode differs from the currently used controlled mode. The simple approach of having several globally visible arrays for spectrum.c to statically choose from no longer works. Patch #1 thus privatizes all FID initialization and finalization logic, and exposes it as ops instead. - Patch #2 renames the ops that are specific to the controlled mode, to make room in the namespace for the CFF variants. Patch #3 extracts a helper to compute flood table base out of mlxsw_sp_fid_flood_table_mid(). - The op fid_setup configured fid_offset, i.e. the number of this FID within its family. For rFIDs in CFF mode, to determine this number, the driver will need to do fallible queries. Thus in patch #4, make the FID setup operation fallible as well. - Flood mode initialization routine differs between the controlled and CFF flood modes. The controlled mode needs to configure flood table layout, which the CFF mode does not need to do. In patch #5, move mlxsw_sp_fid_flood_table_init() up so that the following patch can make use of it. In patch #6, add an op to be invoked per table (if defined). - The current way of determining PGT allocation size depends on the number of FIDs and number of flood tables. RFIDs however have PGT footprint depending not on number of FIDs, but on number of ports and LAGs, because which ports an rFID should flood to does not depend on the FID itself, but on the port or LAG that it references. Therefore in patch #7, add FID family ops for determining PGT allocation size. - As elaborated above, layout of PGT will differ between controlled and CFF flood modes. In CFF mode, it will further differ between rFIDs and other FIDs (as described at previous patch). The way to pack the SFMR register to configure a FID will likewise differ from controlled to CFF. Thus in patches #8 and #9 add FID family ops to determine PGT base address for a FID and to pack SFMR. - Patches #10 and #11 add more bits for RSP support. In patch #10, add a new traffic type enumerator, for non-UC traffic. This is a combination of BC and MC traffic, but the way that mlxsw maps these mnemonic names to actual traffic type configurations requires that we have a new name to describe this class of traffic. Patch #11 then adds hooks necessary for RSP table maintenance. As ports come and go, and join and leave LAGs, it is necessary to update flood vectors that the rFIDs use. These new hooks will make that possible. - Patches msm8953-mainline#12, msm8953-mainline#13 and msm8953-mainline#14 introduce flood profiles. These have been implicit so far, but the way that CFF flood mode works with profile IDs requires that we make them explicit. Thus in patch msm8953-mainline#12, introduce flood profile objects as a set of flood tables that FID families then refer to. The FID code currently only uses a single flood profile. In patch msm8953-mainline#13, add a flood profile ID to flood profile objects. In patch msm8953-mainline#14, when in CFF mode, configure SFFP according to the existing flood profiles (or the one that exists as of that point). - Patches msm8953-mainline#15 and msm8953-mainline#16 add code to implement, respectively, bridge FIDs and RSP FIDs in CFF mode. - In patch msm8953-mainline#17, toggle flood_mode_prefer_cff on Spectrum-2 and above, which makes the newly-added code live. ==================== Link: https://lore.kernel.org/r/cover.1701183891.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.