Skip to content

Conversation

@haraldh
Copy link

@haraldh haraldh commented Sep 23, 2021

According to:

SEV Secure Nested Paging Firmware ABI Specification
Publication # 56860 Revision: 0.9
Issue Date: April 1, 2021
Page 68, Table 54. Layout of the CMDBUF_SNP_LAUNCH_START Structure

Signed-off-by: Harald Hoyer harald@profian.com

According to:

SEV Secure Nested Paging Firmware ABI Specification
Publication # 56860 Revision: 0.9
Issue Date: April 1, 2021
Page 68, Table 54. Layout of the CMDBUF_SNP_LAUNCH_START Structure

Signed-off-by: Harald Hoyer <harald@profian.com>
@haraldh
Copy link
Author

haraldh commented Sep 23, 2021

meh, it's __packed anyway

@haraldh haraldh closed this Sep 23, 2021
@haraldh
Copy link
Author

haraldh commented Sep 23, 2021

correct fix #3

codomania pushed a commit that referenced this pull request Feb 9, 2022
arm32 uses software to simulate the instruction replaced
by kprobe. some instructions may be simulated by constructing
assembly functions. therefore, before executing instruction
simulation, it is necessary to construct assembly function
execution environment in C language through binding registers.
after kasan is enabled, the register binding relationship will
be destroyed, resulting in instruction simulation errors and
causing kernel panic.

the kprobe emulate instruction function is distributed in three
files: actions-common.c actions-arm.c actions-thumb.c, so disable
KASAN when compiling these files.

for example, use kprobe insert on cap_capable+20 after kasan
enabled, the cap_capable assembly code is as follows:
<cap_capable>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e1a05000	mov	r5, r0
e280006c	add	r0, r0, torvalds#108    ; 0x6c
e1a04001	mov	r4, r1
e1a06002	mov	r6, r2
e59fa090	ldr	sl, [pc, torvalds#144]  ;
ebfc7bf8	bl	c03aa4b4 <__asan_load4>
e595706c	ldr	r7, [r5, torvalds#108]  ; 0x6c
e2859014	add	r9, r5, torvalds#20
......
The emulate_ldr assembly code after enabling kasan is as follows:
c06f1384 <emulate_ldr>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e282803c	add	r8, r2, torvalds#60     ; 0x3c
e1a05000	mov	r5, r0
e7e37855	ubfx	r7, r5, torvalds#16, #4
e1a00008	mov	r0, r8
e1a09001	mov	r9, r1
e1a04002	mov	r4, r2
ebf35462	bl	c03c6530 <__asan_load4>
e357000f	cmp	r7, torvalds#15
e7e36655	ubfx	r6, r5, torvalds#12, #4
e205a00f	and	sl, r5, torvalds#15
0a000001	beq	c06f13bc <emulate_ldr+0x38>
e0840107	add	r0, r4, r7, lsl #2
ebf3545c	bl	c03c6530 <__asan_load4>
e084010a	add	r0, r4, sl, lsl #2
ebf3545a	bl	c03c6530 <__asan_load4>
e2890010	add	r0, r9, torvalds#16
ebf35458	bl	c03c6530 <__asan_load4>
e5990010	ldr	r0, [r9, torvalds#16]
e12fff30	blx	r0
e356000f	cm	r6, torvalds#15
1a000014	bne	c06f1430 <emulate_ldr+0xac>
e1a06000	mov	r6, r0
e2840040	add	r0, r4, torvalds#64     ; 0x40
......

when running in emulate_ldr to simulate the ldr instruction, panic
occurred, and the log is as follows:
Unable to handle kernel NULL pointer dereference at virtual address
00000090
pgd = ecb46400
[00000090] *pgd=2e0fa003, *pmd=00000000
Internal error: Oops: 206 [#1] SMP ARM
PC is at cap_capable+0x14/0xb0
LR is at emulate_ldr+0x50/0xc0
psr: 600d0293 sp : ecd63af8  ip : 00000004  fp : c0a7c30c
r10: 00000000  r9 : c30897f4  r8 : ecd63cd4
r7 : 0000000f  r6 : 0000000a  r5 : e59fa090  r4 : ecd63c98
r3 : c06ae294  r2 : 00000000  r1 : b7611300  r0 : bf4ec008
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 32c5387d  Table: 2d546400  DAC: 55555555
Process bash (pid: 1643, stack limit = 0xecd60190)
(cap_capable) from (kprobe_handler+0x218/0x340)
(kprobe_handler) from (kprobe_trap_handler+0x24/0x48)
(kprobe_trap_handler) from (do_undefinstr+0x13c/0x364)
(do_undefinstr) from (__und_svc_finish+0x0/0x30)
(__und_svc_finish) from (cap_capable+0x18/0xb0)
(cap_capable) from (cap_vm_enough_memory+0x38/0x48)
(cap_vm_enough_memory) from
(security_vm_enough_memory_mm+0x48/0x6c)
(security_vm_enough_memory_mm) from
(copy_process.constprop.5+0x16b4/0x25c8)
(copy_process.constprop.5) from (_do_fork+0xe8/0x55c)
(_do_fork) from (SyS_clone+0x1c/0x24)
(SyS_clone) from (__sys_trace_return+0x0/0x10)
Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7)

Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support")
Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM")
Signed-off-by: huangshaobo <huangshaobo6@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
codomania pushed a commit that referenced this pull request Feb 9, 2022
Change the cifs filesystem to take account of the changes to fscache's
indexing rewrite and reenable caching in cifs.

The following changes have been made:

 (1) The fscache_netfs struct is no more, and there's no need to register
     the filesystem as a whole.

 (2) The session cookie is now an fscache_volume cookie, allocated with
     fscache_acquire_volume().  That takes three parameters: a string
     representing the "volume" in the index, a string naming the cache to
     use (or NULL) and a u64 that conveys coherency metadata for the
     volume.

     For cifs, I've made it render the volume name string as:

	"cifs,<ipaddress>,<sharename>"

     where the sharename has '/' characters replaced with ';'.

     This probably needs rethinking a bit as the total name could exceed
     the maximum filename component length.

     Further, the coherency data is currently just set to 0.  It needs
     something else doing with it - I wonder if it would suffice simply to
     sum the resource_id, vol_create_time and vol_serial_number or maybe
     hash them.

 (3) The fscache_cookie_def is no more and needed information is passed
     directly to fscache_acquire_cookie().  The cache no longer calls back
     into the filesystem, but rather metadata changes are indicated at
     other times.

     fscache_acquire_cookie() is passed the same keying and coherency
     information as before.

 (4) The functions to set/reset cookies are removed and
     fscache_use_cookie() and fscache_unuse_cookie() are used instead.

     fscache_use_cookie() is passed a flag to indicate if the cookie is
     opened for writing.  fscache_unuse_cookie() is passed updates for the
     metadata if we changed it (ie. if the file was opened for writing).

     These are called when the file is opened or closed.

 (5) cifs_setattr_*() are made to call fscache_resize() to change the size
     of the cache object.

 (6) The functions to read and write data are stubbed out pending a
     conversion to use netfslib.

Changes
=======
ver torvalds#8:
 - Abstract cache invalidation into a helper function.
 - Fix some checkpatch warnings[3].

ver torvalds#7:
 - Removed the accidentally added-back call to get the super cookie in
   cifs_root_iget().
 - Fixed the right call to cifs_fscache_get_super_cookie() to take account
   of the "-o fsc" mount flag.

ver #6:
 - Moved the change of gfpflags_allow_blocking() to current_is_kswapd() for
   cifs here.
 - Fixed one of the error paths in cifs_atomic_open() to jump around the
   call to use the cookie.
 - Fixed an additional successful return in the middle of cifs_open() to
   use the cookie on the way out.
 - Only get a volume cookie (and thus inode cookies) when "-o fsc" is
   supplied to mount.

ver #5:
 - Fixed a couple of bits of cookie handling[2]:
   - The cookie should be released in cifs_evict_inode(), not
     cifsFileInfo_put_final().  The cookie needs to persist beyond file
     closure so that writepages will be able to write to it.
   - fscache_use_cookie() needs to be called in cifs_atomic_open() as it is
     for cifs_open().

ver #4:
 - Fixed the use of sizeof with memset.
 - tcon->vol_create_time is __le64 so doesn't need cpu_to_le64().

ver #3:
 - Canonicalise the cifs coherency data to make the cache portable.
 - Set volume coherency data.

ver #2:
 - Use gfpflags_allow_blocking() rather than using flag directly.
 - Upgraded to -rc4 to allow for upstream changes[1].
 - fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Steve French <smfrench@gmail.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23b55d673d7527b093cd97b7c217c82e70cd1af0 [1]
Link: https://lore.kernel.org/r/3419813.1641592362@warthog.procyon.org.uk/ [2]
Link: https://lore.kernel.org/r/CAH2r5muTanw9pJqzAHd01d9A8keeChkzGsCEH6=0rHutVLAF-A@mail.gmail.com/ [3]
Link: https://lore.kernel.org/r/163819671009.215744.11230627184193298714.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906982979.143852.10672081929614953210.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967187187.1823006.247415138444991444.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021579335.640689.2681324337038770579.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/3462849.1641593783@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/1318953.1642024578@warthog.procyon.org.uk/ # v6
Signed-off-by: Steve French <stfrench@microsoft.com>
codomania pushed a commit that referenced this pull request Feb 9, 2022
…y if PMI is pending

Running selftest with CONFIG_PPC_IRQ_SOFT_MASK_DEBUG enabled in kernel
triggered below warning:

[  172.851380] ------------[ cut here ]------------
[  172.851391] WARNING: CPU: 8 PID: 2901 at arch/powerpc/include/asm/hw_irq.h:246 power_pmu_disable+0x270/0x280
[  172.851402] Modules linked in: dm_mod bonding nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables rfkill nfnetlink sunrpc xfs libcrc32c pseries_rng xts vmx_crypto uio_pdrv_genirq uio sch_fq_codel ip_tables ext4 mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse
[  172.851442] CPU: 8 PID: 2901 Comm: lost_exception_ Not tainted 5.16.0-rc5-03218-g798527287598 #2
[  172.851451] NIP:  c00000000013d600 LR: c00000000013d5a4 CTR: c00000000013b180
[  172.851458] REGS: c000000017687860 TRAP: 0700   Not tainted  (5.16.0-rc5-03218-g798527287598)
[  172.851465] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48004884  XER: 20040000
[  172.851482] CFAR: c00000000013d5b4 IRQMASK: 1
[  172.851482] GPR00: c00000000013d5a4 c000000017687b00 c000000002a10600 0000000000000004
[  172.851482] GPR04: 0000000082004000 c0000008ba08f0a8 0000000000000000 00000008b7ed0000
[  172.851482] GPR08: 00000000446194f6 0000000000008000 c00000000013b118 c000000000d58e68
[  172.851482] GPR12: c00000000013d390 c00000001ec54a80 0000000000000000 0000000000000000
[  172.851482] GPR16: 0000000000000000 0000000000000000 c000000015d5c708 c0000000025396d0
[  172.851482] GPR20: 0000000000000000 0000000000000000 c00000000a3bbf40 0000000000000003
[  172.851482] GPR24: 0000000000000000 c0000008ba097400 c0000000161e0d00 c00000000a3bb600
[  172.851482] GPR28: c000000015d5c700 0000000000000001 0000000082384090 c0000008ba0020d8
[  172.851549] NIP [c00000000013d600] power_pmu_disable+0x270/0x280
[  172.851557] LR [c00000000013d5a4] power_pmu_disable+0x214/0x280
[  172.851565] Call Trace:
[  172.851568] [c000000017687b00] [c00000000013d5a4] power_pmu_disable+0x214/0x280 (unreliable)
[  172.851579] [c000000017687b40] [c0000000003403ac] perf_pmu_disable+0x4c/0x60
[  172.851588] [c000000017687b60] [c0000000003445e4] __perf_event_task_sched_out+0x1d4/0x660
[  172.851596] [c000000017687c50] [c000000000d1175c] __schedule+0xbcc/0x12a0
[  172.851602] [c000000017687d60] [c000000000d11ea8] schedule+0x78/0x140
[  172.851608] [c000000017687d90] [c0000000001a8080] sys_sched_yield+0x20/0x40
[  172.851615] [c000000017687db0] [c0000000000334dc] system_call_exception+0x18c/0x380
[  172.851622] [c000000017687e10] [c00000000000c74c] system_call_common+0xec/0x268

The warning indicates that MSR_EE being set(interrupt enabled) when
there was an overflown PMC detected. This could happen in
power_pmu_disable since it runs under interrupt soft disable
condition ( local_irq_save ) and not with interrupts hard disabled.
commit 2c9ac51 ("powerpc/perf: Fix PMU callbacks to clear
pending PMI before resetting an overflown PMC") intended to clear
PMI pending bit in Paca when disabling the PMU. It could happen
that PMC gets overflown while code is in power_pmu_disable
callback function. Hence add a check to see if PMI pending bit
is set in Paca before clearing it via clear_pmi_pending.

Fixes: 2c9ac51 ("powerpc/perf: Fix PMU callbacks to clear pending PMI before resetting an overflown PMC")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220122033429.25395-1-atrajeev@linux.vnet.ibm.com
codomania pushed a commit that referenced this pull request Feb 9, 2022
… devices

Suppose we have an environment with a number of non-NPIV FCP devices
(virtual HBAs / FCP devices / zfcp "adapter"s) sharing the same physical
FCP channel (HBA port) and its I_T nexus. Plus a number of storage target
ports zoned to such shared channel. Now one target port logs out of the
fabric causing an RSCN. Zfcp reacts with an ADISC ELS and subsequent port
recovery depending on the ADISC result. This happens on all such FCP
devices (in different Linux images) concurrently as they all receive a copy
of this RSCN. In the following we look at one of those FCP devices.

Requests other than FSF_QTCB_FCP_CMND can be slow until they get a
response.

Depending on which requests are affected by slow responses, there are
different recovery outcomes. Here we want to fix failed recoveries on port
or adapter level by avoiding recovery requests that can be slow.

We need the cached N_Port_ID for the remote port "link" test with ADISC.
Just before sending the ADISC, we now intentionally forget the old cached
N_Port_ID. The idea is that on receiving an RSCN for a port, we have to
assume that any cached information about this port is stale.  This forces a
fresh new GID_PN [FC-GS] nameserver lookup on any subsequent recovery for
the same port. Since we typically can still communicate with the nameserver
efficiently, we now reach steady state quicker: Either the nameserver still
does not know about the port so we stop recovery, or the nameserver already
knows the port potentially with a new N_Port_ID and we can successfully and
quickly perform open port recovery.  For the one case, where ADISC returns
successfully, we re-initialize port->d_id because that case does not
involve any port recovery.

This also solves a problem if the storage WWPN quickly logs into the fabric
again but with a different N_Port_ID. Such as on virtual WWPN takeover
during target NPIV failover.
[https://www.redbooks.ibm.com/abstracts/redp5477.html] In that case the
RSCN from the storage FDISC was ignored by zfcp and we could not
successfully recover the failover. On some later failback on the storage,
we could have been lucky if the virtual WWPN got the same old N_Port_ID
from the SAN switch as we still had cached.  Then the related RSCN
triggered a successful port reopen recovery.  However, there is no
guarantee to get the same N_Port_ID on NPIV FDISC.

Even though NPIV-enabled FCP devices are not affected by this problem, this
code change optimizes recovery time for gone remote ports as a side effect.
The timely drop of cached N_Port_IDs prevents unnecessary slow open port
attempts.

While the problem might have been in code before v2.6.32 commit
799b76d ("[SCSI] zfcp: Decouple gid_pn requests from erp") this fix
depends on the gid_pn_work introduced with that commit, so we mark it as
culprit to satisfy fix dependencies.

Note: Point-to-point remote port is already handled separately and gets its
N_Port_ID from the cached peer_d_id. So resetting port->d_id in general
does not affect PtP.

Link: https://lore.kernel.org/r/20220118165803.3667947-1-maier@linux.ibm.com
Fixes: 799b76d ("[SCSI] zfcp: Decouple gid_pn requests from erp")
Cc: <stable@vger.kernel.org> #2.6.32+
Suggested-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Steffen Maier <maier@linux.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
codomania pushed a commit that referenced this pull request Feb 9, 2022
Add a test that sends large udp packet (which is fragmented)
via a stateless nft nat rule, i.e. 'ip saddr set 10.2.3.4'
and check that the datagram is received by peer.

On kernels without
commit 4e1860a ("netfilter: nft_payload: do not update layer 4 checksum when mangling fragments")',
this will fail with:

cmp: EOF on /tmp/tmp.V1q0iXJyQF which is empty
-rw------- 1 root root 4096 Jan 24 22:03 /tmp/tmp.Aaqnq4rBKS
-rw------- 1 root root    0 Jan 24 22:03 /tmp/tmp.V1q0iXJyQF
ERROR: in and output file mismatch when checking udp with stateless nat
FAIL: nftables v1.0.0 (Fearless Fosdick #2)

On patched kernels, this will show:
PASS: IP statless for ns2-PFp89amx

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
codomania pushed a commit that referenced this pull request Feb 9, 2022
Fix the following false positive warning:
 =============================
 WARNING: suspicious RCU usage
 5.16.0-rc4+ torvalds#57 Not tainted
 -----------------------------
 arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 3 locks held by fc_vcpu 0/330:
  #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
  #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
  #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]

 stack backtrace:
 CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x44/0x57
  kvm_notify_acked_gsi+0x6b/0x70 [kvm]
  kvm_notify_acked_irq+0x8d/0x180 [kvm]
  kvm_ioapic_update_eoi+0x92/0x240 [kvm]
  kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
  handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
  vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
  vcpu_enter_guest+0x66e/0x1860 [kvm]
  kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
  kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
  __x64_sys_ioctl+0x89/0xc0
  do_syscall_64+0x3a/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
a false positive warning. So use hlist_for_each_entry_srcu() instead of
hlist_for_each_entry_rcu().

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
codomania pushed a commit that referenced this pull request Feb 9, 2022
Quota disable ioctl starts a transaction before waiting for the qgroup
rescan worker completes. However, this wait can be infinite and results
in deadlock because of circular dependency among the quota disable
ioctl, the qgroup rescan worker and the other task with transaction such
as block group relocation task.

The deadlock happens with the steps following:

1) Task A calls ioctl to disable quota. It starts a transaction and
   waits for qgroup rescan worker completes.
2) Task B such as block group relocation task starts a transaction and
   joins to the transaction that task A started. Then task B commits to
   the transaction. In this commit, task B waits for a commit by task A.
3) Task C as the qgroup rescan worker starts its job and starts a
   transaction. In this transaction start, task C waits for completion
   of the transaction that task A started and task B committed.

This deadlock was found with fstests test case btrfs/115 and a zoned
null_blk device. The test case enables and disables quota, and the
block group reclaim was triggered during the quota disable by chance.
The deadlock was also observed by running quota enable and disable in
parallel with 'btrfs balance' command on regular null_blk devices.

An example report of the deadlock:

  [372.469894] INFO: task kworker/u16:6:103 blocked for more than 122 seconds.
  [372.479944]       Not tainted 5.16.0-rc8 torvalds#7
  [372.485067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [372.493898] task:kworker/u16:6   state:D stack:    0 pid:  103 ppid:     2 flags:0x00004000
  [372.503285] Workqueue: btrfs-qgroup-rescan btrfs_work_helper [btrfs]
  [372.510782] Call Trace:
  [372.514092]  <TASK>
  [372.521684]  __schedule+0xb56/0x4850
  [372.530104]  ? io_schedule_timeout+0x190/0x190
  [372.538842]  ? lockdep_hardirqs_on+0x7e/0x100
  [372.547092]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
  [372.555591]  schedule+0xe0/0x270
  [372.561894]  btrfs_commit_transaction+0x18bb/0x2610 [btrfs]
  [372.570506]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
  [372.578875]  ? free_unref_page+0x3f2/0x650
  [372.585484]  ? finish_wait+0x270/0x270
  [372.591594]  ? release_extent_buffer+0x224/0x420 [btrfs]
  [372.599264]  btrfs_qgroup_rescan_worker+0xc13/0x10c0 [btrfs]
  [372.607157]  ? lock_release+0x3a9/0x6d0
  [372.613054]  ? btrfs_qgroup_account_extent+0xda0/0xda0 [btrfs]
  [372.620960]  ? do_raw_spin_lock+0x11e/0x250
  [372.627137]  ? rwlock_bug.part.0+0x90/0x90
  [372.633215]  ? lock_is_held_type+0xe4/0x140
  [372.639404]  btrfs_work_helper+0x1ae/0xa90 [btrfs]
  [372.646268]  process_one_work+0x7e9/0x1320
  [372.652321]  ? lock_release+0x6d0/0x6d0
  [372.658081]  ? pwq_dec_nr_in_flight+0x230/0x230
  [372.664513]  ? rwlock_bug.part.0+0x90/0x90
  [372.670529]  worker_thread+0x59e/0xf90
  [372.676172]  ? process_one_work+0x1320/0x1320
  [372.682440]  kthread+0x3b9/0x490
  [372.687550]  ? _raw_spin_unlock_irq+0x24/0x50
  [372.693811]  ? set_kthread_struct+0x100/0x100
  [372.700052]  ret_from_fork+0x22/0x30
  [372.705517]  </TASK>
  [372.709747] INFO: task btrfs-transacti:2347 blocked for more than 123 seconds.
  [372.729827]       Not tainted 5.16.0-rc8 torvalds#7
  [372.745907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [372.767106] task:btrfs-transacti state:D stack:    0 pid: 2347 ppid:     2 flags:0x00004000
  [372.787776] Call Trace:
  [372.801652]  <TASK>
  [372.812961]  __schedule+0xb56/0x4850
  [372.830011]  ? io_schedule_timeout+0x190/0x190
  [372.852547]  ? lockdep_hardirqs_on+0x7e/0x100
  [372.871761]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
  [372.886792]  schedule+0xe0/0x270
  [372.901685]  wait_current_trans+0x22c/0x310 [btrfs]
  [372.919743]  ? btrfs_put_transaction+0x3d0/0x3d0 [btrfs]
  [372.938923]  ? finish_wait+0x270/0x270
  [372.959085]  ? join_transaction+0xc75/0xe30 [btrfs]
  [372.977706]  start_transaction+0x938/0x10a0 [btrfs]
  [372.997168]  transaction_kthread+0x19d/0x3c0 [btrfs]
  [373.013021]  ? btrfs_cleanup_transaction.isra.0+0xfc0/0xfc0 [btrfs]
  [373.031678]  kthread+0x3b9/0x490
  [373.047420]  ? _raw_spin_unlock_irq+0x24/0x50
  [373.064645]  ? set_kthread_struct+0x100/0x100
  [373.078571]  ret_from_fork+0x22/0x30
  [373.091197]  </TASK>
  [373.105611] INFO: task btrfs:3145 blocked for more than 123 seconds.
  [373.114147]       Not tainted 5.16.0-rc8 torvalds#7
  [373.120401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [373.130393] task:btrfs           state:D stack:    0 pid: 3145 ppid:  3141 flags:0x00004000
  [373.140998] Call Trace:
  [373.145501]  <TASK>
  [373.149654]  __schedule+0xb56/0x4850
  [373.155306]  ? io_schedule_timeout+0x190/0x190
  [373.161965]  ? lockdep_hardirqs_on+0x7e/0x100
  [373.168469]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
  [373.175468]  schedule+0xe0/0x270
  [373.180814]  wait_for_commit+0x104/0x150 [btrfs]
  [373.187643]  ? test_and_set_bit+0x20/0x20 [btrfs]
  [373.194772]  ? kmem_cache_free+0x124/0x550
  [373.201191]  ? btrfs_put_transaction+0x69/0x3d0 [btrfs]
  [373.208738]  ? finish_wait+0x270/0x270
  [373.214704]  ? __btrfs_end_transaction+0x347/0x7b0 [btrfs]
  [373.222342]  btrfs_commit_transaction+0x44d/0x2610 [btrfs]
  [373.230233]  ? join_transaction+0x255/0xe30 [btrfs]
  [373.237334]  ? btrfs_record_root_in_trans+0x4d/0x170 [btrfs]
  [373.245251]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
  [373.253296]  relocate_block_group+0x105/0xc20 [btrfs]
  [373.260533]  ? mutex_lock_io_nested+0x1270/0x1270
  [373.267516]  ? btrfs_wait_nocow_writers+0x85/0x180 [btrfs]
  [373.275155]  ? merge_reloc_roots+0x710/0x710 [btrfs]
  [373.283602]  ? btrfs_wait_ordered_extents+0xd30/0xd30 [btrfs]
  [373.291934]  ? kmem_cache_free+0x124/0x550
  [373.298180]  btrfs_relocate_block_group+0x35c/0x930 [btrfs]
  [373.306047]  btrfs_relocate_chunk+0x85/0x210 [btrfs]
  [373.313229]  btrfs_balance+0x12f4/0x2d20 [btrfs]
  [373.320227]  ? lock_release+0x3a9/0x6d0
  [373.326206]  ? btrfs_relocate_chunk+0x210/0x210 [btrfs]
  [373.333591]  ? lock_is_held_type+0xe4/0x140
  [373.340031]  ? rcu_read_lock_sched_held+0x3f/0x70
  [373.346910]  btrfs_ioctl_balance+0x548/0x700 [btrfs]
  [373.354207]  btrfs_ioctl+0x7f2/0x71b0 [btrfs]
  [373.360774]  ? lockdep_hardirqs_on_prepare+0x410/0x410
  [373.367957]  ? lockdep_hardirqs_on_prepare+0x410/0x410
  [373.375327]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
  [373.383841]  ? find_held_lock+0x2c/0x110
  [373.389993]  ? lock_release+0x3a9/0x6d0
  [373.395828]  ? mntput_no_expire+0xf7/0xad0
  [373.402083]  ? lock_is_held_type+0xe4/0x140
  [373.408249]  ? vfs_fileattr_set+0x9f0/0x9f0
  [373.414486]  ? selinux_file_ioctl+0x349/0x4e0
  [373.420938]  ? trace_raw_output_lock+0xb4/0xe0
  [373.427442]  ? selinux_inode_getsecctx+0x80/0x80
  [373.434224]  ? lockdep_hardirqs_on+0x7e/0x100
  [373.440660]  ? force_qs_rnp+0x2a0/0x6b0
  [373.446534]  ? lock_is_held_type+0x9b/0x140
  [373.452763]  ? __blkcg_punt_bio_submit+0x1b0/0x1b0
  [373.459732]  ? security_file_ioctl+0x50/0x90
  [373.466089]  __x64_sys_ioctl+0x127/0x190
  [373.472022]  do_syscall_64+0x3b/0x90
  [373.477513]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [373.484823] RIP: 0033:0x7f8f4af7e2bb
  [373.490493] RSP: 002b:00007ffcbf936178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [373.500197] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8f4af7e2bb
  [373.509451] RDX: 00007ffcbf936220 RSI: 00000000c4009420 RDI: 0000000000000003
  [373.518659] RBP: 00007ffcbf93774a R08: 0000000000000013 R09: 00007f8f4b02d4e0
  [373.527872] R10: 00007f8f4ae87740 R11: 0000000000000246 R12: 0000000000000001
  [373.537222] R13: 00007ffcbf936220 R14: 0000000000000000 R15: 0000000000000002
  [373.546506]  </TASK>
  [373.550878] INFO: task btrfs:3146 blocked for more than 123 seconds.
  [373.559383]       Not tainted 5.16.0-rc8 torvalds#7
  [373.565748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [373.575748] task:btrfs           state:D stack:    0 pid: 3146 ppid:  2168 flags:0x00000000
  [373.586314] Call Trace:
  [373.590846]  <TASK>
  [373.595121]  __schedule+0xb56/0x4850
  [373.600901]  ? __lock_acquire+0x23db/0x5030
  [373.607176]  ? io_schedule_timeout+0x190/0x190
  [373.613954]  schedule+0xe0/0x270
  [373.619157]  schedule_timeout+0x168/0x220
  [373.625170]  ? usleep_range_state+0x150/0x150
  [373.631653]  ? mark_held_locks+0x9e/0xe0
  [373.637767]  ? do_raw_spin_lock+0x11e/0x250
  [373.643993]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
  [373.651267]  ? _raw_spin_unlock_irq+0x24/0x50
  [373.657677]  ? lockdep_hardirqs_on+0x7e/0x100
  [373.664103]  wait_for_completion+0x163/0x250
  [373.670437]  ? bit_wait_timeout+0x160/0x160
  [373.676585]  btrfs_quota_disable+0x176/0x9a0 [btrfs]
  [373.683979]  ? btrfs_quota_enable+0x12f0/0x12f0 [btrfs]
  [373.691340]  ? down_write+0xd0/0x130
  [373.696880]  ? down_write_killable+0x150/0x150
  [373.703352]  btrfs_ioctl+0x3945/0x71b0 [btrfs]
  [373.710061]  ? find_held_lock+0x2c/0x110
  [373.716192]  ? lock_release+0x3a9/0x6d0
  [373.722047]  ? __handle_mm_fault+0x23cd/0x3050
  [373.728486]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
  [373.737032]  ? set_pte+0x6a/0x90
  [373.742271]  ? do_raw_spin_unlock+0x55/0x1f0
  [373.748506]  ? lock_is_held_type+0xe4/0x140
  [373.754792]  ? vfs_fileattr_set+0x9f0/0x9f0
  [373.761083]  ? selinux_file_ioctl+0x349/0x4e0
  [373.767521]  ? selinux_inode_getsecctx+0x80/0x80
  [373.774247]  ? __up_read+0x182/0x6e0
  [373.780026]  ? count_memcg_events.constprop.0+0x46/0x60
  [373.787281]  ? up_write+0x460/0x460
  [373.792932]  ? security_file_ioctl+0x50/0x90
  [373.799232]  __x64_sys_ioctl+0x127/0x190
  [373.805237]  do_syscall_64+0x3b/0x90
  [373.810947]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [373.818102] RIP: 0033:0x7f1383ea02bb
  [373.823847] RSP: 002b:00007fffeb4d71f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  [373.833641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1383ea02bb
  [373.842961] RDX: 00007fffeb4d7210 RSI: 00000000c0109428 RDI: 0000000000000003
  [373.852179] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
  [373.861408] R10: 00007f1383daec78 R11: 0000000000000202 R12: 00007fffeb4d874a
  [373.870647] R13: 0000000000493099 R14: 0000000000000001 R15: 0000000000000000
  [373.879838]  </TASK>
  [373.884018]
               Showing all locks held in the system:
  [373.894250] 3 locks held by kworker/4:1/58:
  [373.900356] 1 lock held by khungtaskd/63:
  [373.906333]  #0: ffffffff8945ff60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260
  [373.917307] 3 locks held by kworker/u16:6/103:
  [373.923938]  #0: ffff888127b4f138 ((wq_completion)btrfs-qgroup-rescan){+.+.}-{0:0}, at: process_one_work+0x712/0x1320
  [373.936555]  #1: ffff88810b817dd8 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x73f/0x1320
  [373.951109]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_qgroup_rescan_worker+0x1f6/0x10c0 [btrfs]
  [373.964027] 2 locks held by less/1803:
  [373.969982]  #0: ffff88813ed56098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x80
  [373.981295]  #1: ffffc90000b3b2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x9e2/0x1060
  [373.992969] 1 lock held by btrfs-transacti/2347:
  [373.999893]  #0: ffff88813d4887a8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0xe3/0x3c0 [btrfs]
  [374.015872] 3 locks held by btrfs/3145:
  [374.022298]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl_balance+0xc3/0x700 [btrfs]
  [374.034456]  #1: ffff88813d48a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0xfe5/0x2d20 [btrfs]
  [374.047646]  #2: ffff88813d488838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x354/0x930 [btrfs]
  [374.063295] 4 locks held by btrfs/3146:
  [374.069647]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl+0x38b1/0x71b0 [btrfs]
  [374.081601]  #1: ffff88813d488bb8 (&fs_info->subvol_sem){+.+.}-{3:3}, at: btrfs_ioctl+0x38fd/0x71b0 [btrfs]
  [374.094283]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_disable+0xc8/0x9a0 [btrfs]
  [374.106885]  #3: ffff88813d489800 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_disable+0xd5/0x9a0 [btrfs]

  [374.126780] =============================================

To avoid the deadlock, wait for the qgroup rescan worker to complete
before starting the transaction for the quota disable ioctl. Clear
BTRFS_FS_QUOTA_ENABLE flag before the wait and the transaction to
request the worker to complete. On transaction start failure, set the
BTRFS_FS_QUOTA_ENABLE flag again. These BTRFS_FS_QUOTA_ENABLE flag
changes can be done safely since the function btrfs_quota_disable is not
called concurrently because of fs_info->subvol_sem.

Also check the BTRFS_FS_QUOTA_ENABLE flag in qgroup_rescan_init to avoid
another qgroup rescan worker to start after the previous qgroup worker
completed.

CC: stable@vger.kernel.org # 5.4+
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
codomania pushed a commit that referenced this pull request Feb 9, 2022
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.17, take #2

- A couple of fixes when handling an exception while a SError has been
  delivered

- Workaround for Cortex-A510's single-step[ erratum
codomania pushed a commit that referenced this pull request Feb 24, 2022
…egulator

The interrupt pin of the external ethernet phy is used, instead of the
enable-gpio pin of the tf-io regulator. The GPIOE_2 pin is located in
the gpio_ao bank.

This causes phy interrupt problems at system startup.
[   76.645190] irq 36: nobody cared (try booting with the "irqpoll" option)
[   76.649617] CPU: 0 PID: 1416 Comm: irq/36-0.0:00 Not tainted 5.16.0 #2
[   76.649629] Hardware name: Hardkernel ODROID-HC4 (DT)
[   76.649635] Call trace:
[   76.649638]  dump_backtrace+0x0/0x1c8
[   76.649658]  show_stack+0x14/0x60
[   76.649667]  dump_stack_lvl+0x64/0x7c
[   76.649676]  dump_stack+0x14/0x2c
[   76.649683]  __report_bad_irq+0x38/0xe8
[   76.649695]  note_interrupt+0x220/0x3a0
[   76.649704]  handle_irq_event_percpu+0x58/0x88
[   76.649713]  handle_irq_event+0x44/0xd8
[   76.649721]  handle_fasteoi_irq+0xa8/0x130
[   76.649730]  generic_handle_domain_irq+0x38/0x58
[   76.649738]  gic_handle_irq+0x9c/0xb8
[   76.649747]  call_on_irq_stack+0x28/0x38
[   76.649755]  do_interrupt_handler+0x7c/0x80
[   76.649763]  el1_interrupt+0x34/0x80
[   76.649772]  el1h_64_irq_handler+0x14/0x20
[   76.649781]  el1h_64_irq+0x74/0x78
[   76.649788]  irq_finalize_oneshot.part.56+0x68/0xf8
[   76.649796]  irq_thread_fn+0x5c/0x98
[   76.649804]  irq_thread+0x13c/0x260
[   76.649812]  kthread+0x144/0x178
[   76.649822]  ret_from_fork+0x10/0x20
[   76.649830] handlers:
[   76.653170] [<0000000025a6cd31>] irq_default_primary_handler threaded [<0000000093580eb7>] phy_interrupt
[   76.661256] Disabling IRQ torvalds#36

Fixes: 1f80a5c ("arm64: dts: meson-sm1-odroid: add missing enable gpio and supply for tf_io regulator")
Signed-off-by: Lutz Koschorreck <theleks@ko-hh.de>
Reviewed-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
[narmstrong: removed spurious invalid & blank lines from commit message]
Link: https://lore.kernel.org/r/20220127130537.GA187347@odroid-VirtualBox
codomania pushed a commit that referenced this pull request Feb 24, 2022
When using the flushoncommit mount option, during almost every transaction
commit we trigger a warning from __writeback_inodes_sb_nr():

  $ cat fs/fs-writeback.c:
  (...)
  static void __writeback_inodes_sb_nr(struct super_block *sb, ...
  {
        (...)
        WARN_ON(!rwsem_is_locked(&sb->s_umount));
        (...)
  }
  (...)

The trace produced in dmesg looks like the following:

  [947.473890] WARNING: CPU: 5 PID: 930 at fs/fs-writeback.c:2610 __writeback_inodes_sb_nr+0x7e/0xb3
  [947.481623] Modules linked in: nfsd nls_cp437 cifs asn1_decoder cifs_arc4 fscache cifs_md4 ipmi_ssif
  [947.489571] CPU: 5 PID: 930 Comm: btrfs-transacti Not tainted 95.16.3-srb-asrock-00001-g36437ad63879 torvalds#186
  [947.497969] RIP: 0010:__writeback_inodes_sb_nr+0x7e/0xb3
  [947.502097] Code: 24 10 4c 89 44 24 18 c6 (...)
  [947.519760] RSP: 0018:ffffc90000777e10 EFLAGS: 00010246
  [947.523818] RAX: 0000000000000000 RBX: 0000000000963300 RCX: 0000000000000000
  [947.529765] RDX: 0000000000000000 RSI: 000000000000fa51 RDI: ffffc90000777e50
  [947.535740] RBP: ffff888101628a90 R08: ffff888100955800 R09: ffff888100956000
  [947.541701] R10: 0000000000000002 R11: 0000000000000001 R12: ffff888100963488
  [947.547645] R13: ffff888100963000 R14: ffff888112fb7200 R15: ffff888100963460
  [947.553621] FS:  0000000000000000(0000) GS:ffff88841fd40000(0000) knlGS:0000000000000000
  [947.560537] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [947.565122] CR2: 0000000008be50c4 CR3: 000000000220c000 CR4: 00000000001006e0
  [947.571072] Call Trace:
  [947.572354]  <TASK>
  [947.573266]  btrfs_commit_transaction+0x1f1/0x998
  [947.576785]  ? start_transaction+0x3ab/0x44e
  [947.579867]  ? schedule_timeout+0x8a/0xdd
  [947.582716]  transaction_kthread+0xe9/0x156
  [947.585721]  ? btrfs_cleanup_transaction.isra.0+0x407/0x407
  [947.590104]  kthread+0x131/0x139
  [947.592168]  ? set_kthread_struct+0x32/0x32
  [947.595174]  ret_from_fork+0x22/0x30
  [947.597561]  </TASK>
  [947.598553] ---[ end trace 644721052755541c ]---

This is because we started using writeback_inodes_sb() to flush delalloc
when committing a transaction (when using -o flushoncommit), in order to
avoid deadlocks with filesystem freeze operations. This change was made
by commit ce8ea7c ("btrfs: don't call btrfs_start_delalloc_roots
in flushoncommit"). After that change we started producing that warning,
and every now and then a user reports this since the warning happens too
often, it spams dmesg/syslog, and a user is unsure if this reflects any
problem that might compromise the filesystem's reliability.

We can not just lock the sb->s_umount semaphore before calling
writeback_inodes_sb(), because that would at least deadlock with
filesystem freezing, since at fs/super.c:freeze_super() sync_filesystem()
is called while we are holding that semaphore in write mode, and that can
trigger a transaction commit, resulting in a deadlock. It would also
trigger the same type of deadlock in the unmount path. Possibly, it could
also introduce some other locking dependencies that lockdep would report.

To fix this call try_to_writeback_inodes_sb() instead of
writeback_inodes_sb(), because that will try to read lock sb->s_umount
and then will only call writeback_inodes_sb() if it was able to lock it.
This is fine because the cases where it can't read lock sb->s_umount
are during a filesystem unmount or during a filesystem freeze - in those
cases sb->s_umount is write locked and sync_filesystem() is called, which
calls writeback_inodes_sb(). In other words, in all cases where we can't
take a read lock on sb->s_umount, writeback is already being triggered
elsewhere.

An alternative would be to call btrfs_start_delalloc_roots() with a
number of pages different from LONG_MAX, for example matching the number
of delalloc bytes we currently have, in which case we would end up
starting all delalloc with filemap_fdatawrite_wbc() and not with an
async flush via filemap_flush() - that is only possible after the rather
recent commit e076ab2 ("btrfs: shrink delalloc pages instead of
full inodes"). However that creates a whole new can of worms due to new
lock dependencies, which lockdep complains, like for example:

[ 8948.247280] ======================================================
[ 8948.247823] WARNING: possible circular locking dependency detected
[ 8948.248353] 5.17.0-rc1-btrfs-next-111 #1 Not tainted
[ 8948.248786] ------------------------------------------------------
[ 8948.249320] kworker/u16:18/933570 is trying to acquire lock:
[ 8948.249812] ffff9b3de1591690 (sb_internal#2){.+.+}-{0:0}, at: find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.250638]
               but task is already holding lock:
[ 8948.251140] ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.252018]
               which lock already depends on the new lock.

[ 8948.252710]
               the existing dependency chain (in reverse order) is:
[ 8948.253343]
               -> #2 (&root->delalloc_mutex){+.+.}-{3:3}:
[ 8948.253950]        __mutex_lock+0x90/0x900
[ 8948.254354]        start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.254859]        btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.255408]        btrfs_commit_transaction+0x32f/0xc00 [btrfs]
[ 8948.255942]        btrfs_mksubvol+0x380/0x570 [btrfs]
[ 8948.256406]        btrfs_mksnapshot+0x81/0xb0 [btrfs]
[ 8948.256870]        __btrfs_ioctl_snap_create+0x17f/0x190 [btrfs]
[ 8948.257413]        btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[ 8948.257961]        btrfs_ioctl+0x1196/0x3630 [btrfs]
[ 8948.258418]        __x64_sys_ioctl+0x83/0xb0
[ 8948.258793]        do_syscall_64+0x3b/0xc0
[ 8948.259146]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 8948.259709]
               -> #1 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}:
[ 8948.260330]        __mutex_lock+0x90/0x900
[ 8948.260692]        btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
[ 8948.261234]        btrfs_commit_transaction+0x32f/0xc00 [btrfs]
[ 8948.261766]        btrfs_set_free_space_cache_v1_active+0x38/0x60 [btrfs]
[ 8948.262379]        btrfs_start_pre_rw_mount+0x119/0x180 [btrfs]
[ 8948.262909]        open_ctree+0x1511/0x171e [btrfs]
[ 8948.263359]        btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 8948.263863]        legacy_get_tree+0x30/0x50
[ 8948.264242]        vfs_get_tree+0x28/0xc0
[ 8948.264594]        vfs_kern_mount.part.0+0x71/0xb0
[ 8948.265017]        btrfs_mount+0x11d/0x3a0 [btrfs]
[ 8948.265462]        legacy_get_tree+0x30/0x50
[ 8948.265851]        vfs_get_tree+0x28/0xc0
[ 8948.266203]        path_mount+0x2d4/0xbe0
[ 8948.266554]        __x64_sys_mount+0x103/0x140
[ 8948.266940]        do_syscall_64+0x3b/0xc0
[ 8948.267300]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 8948.267790]
               -> #0 (sb_internal#2){.+.+}-{0:0}:
[ 8948.268322]        __lock_acquire+0x12e8/0x2260
[ 8948.268733]        lock_acquire+0xd7/0x310
[ 8948.269092]        start_transaction+0x44c/0x6e0 [btrfs]
[ 8948.269591]        find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.270087]        btrfs_reserve_extent+0x14b/0x280 [btrfs]
[ 8948.270588]        cow_file_range+0x17e/0x490 [btrfs]
[ 8948.271051]        btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
[ 8948.271586]        writepage_delalloc+0xb5/0x170 [btrfs]
[ 8948.272071]        __extent_writepage+0x156/0x3c0 [btrfs]
[ 8948.272579]        extent_write_cache_pages+0x263/0x460 [btrfs]
[ 8948.273113]        extent_writepages+0x76/0x130 [btrfs]
[ 8948.273573]        do_writepages+0xd2/0x1c0
[ 8948.273942]        filemap_fdatawrite_wbc+0x68/0x90
[ 8948.274371]        start_delalloc_inodes+0x17f/0x400 [btrfs]
[ 8948.274876]        btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.275417]        flush_space+0x1f2/0x630 [btrfs]
[ 8948.275863]        btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[ 8948.276438]        process_one_work+0x252/0x5a0
[ 8948.276829]        worker_thread+0x55/0x3b0
[ 8948.277189]        kthread+0xf2/0x120
[ 8948.277506]        ret_from_fork+0x22/0x30
[ 8948.277868]
               other info that might help us debug this:

[ 8948.278548] Chain exists of:
                 sb_internal#2 --> &fs_info->delalloc_root_mutex --> &root->delalloc_mutex

[ 8948.279601]  Possible unsafe locking scenario:

[ 8948.280102]        CPU0                    CPU1
[ 8948.280508]        ----                    ----
[ 8948.280915]   lock(&root->delalloc_mutex);
[ 8948.281271]                                lock(&fs_info->delalloc_root_mutex);
[ 8948.281915]                                lock(&root->delalloc_mutex);
[ 8948.282487]   lock(sb_internal#2);
[ 8948.282800]
                *** DEADLOCK ***

[ 8948.283333] 4 locks held by kworker/u16:18/933570:
[ 8948.283750]  #0: ffff9b3dc00a9d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
[ 8948.284609]  #1: ffffa90349dafe70 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
[ 8948.285637]  #2: ffff9b3e14db5040 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}, at: btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
[ 8948.286674]  #3: ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.287596]
              stack backtrace:
[ 8948.287975] CPU: 3 PID: 933570 Comm: kworker/u16:18 Not tainted 5.17.0-rc1-btrfs-next-111 #1
[ 8948.288677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 8948.289649] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
[ 8948.290298] Call Trace:
[ 8948.290517]  <TASK>
[ 8948.290700]  dump_stack_lvl+0x59/0x73
[ 8948.291026]  check_noncircular+0xf3/0x110
[ 8948.291375]  ? start_transaction+0x228/0x6e0 [btrfs]
[ 8948.291826]  __lock_acquire+0x12e8/0x2260
[ 8948.292241]  lock_acquire+0xd7/0x310
[ 8948.292714]  ? find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.293241]  ? lock_is_held_type+0xea/0x140
[ 8948.293601]  start_transaction+0x44c/0x6e0 [btrfs]
[ 8948.294055]  ? find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.294518]  find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.294957]  ? _raw_spin_unlock+0x29/0x40
[ 8948.295312]  ? btrfs_get_alloc_profile+0x124/0x290 [btrfs]
[ 8948.295813]  btrfs_reserve_extent+0x14b/0x280 [btrfs]
[ 8948.296270]  cow_file_range+0x17e/0x490 [btrfs]
[ 8948.296691]  btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
[ 8948.297175]  ? find_lock_delalloc_range+0x247/0x270 [btrfs]
[ 8948.297678]  writepage_delalloc+0xb5/0x170 [btrfs]
[ 8948.298123]  __extent_writepage+0x156/0x3c0 [btrfs]
[ 8948.298570]  extent_write_cache_pages+0x263/0x460 [btrfs]
[ 8948.299061]  extent_writepages+0x76/0x130 [btrfs]
[ 8948.299495]  do_writepages+0xd2/0x1c0
[ 8948.299817]  ? sched_clock_cpu+0xd/0x110
[ 8948.300160]  ? lock_release+0x155/0x4a0
[ 8948.300494]  filemap_fdatawrite_wbc+0x68/0x90
[ 8948.300874]  ? do_raw_spin_unlock+0x4b/0xa0
[ 8948.301243]  start_delalloc_inodes+0x17f/0x400 [btrfs]
[ 8948.301706]  ? lock_release+0x155/0x4a0
[ 8948.302055]  btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.302564]  flush_space+0x1f2/0x630 [btrfs]
[ 8948.302970]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[ 8948.303510]  process_one_work+0x252/0x5a0
[ 8948.303860]  ? process_one_work+0x5a0/0x5a0
[ 8948.304221]  worker_thread+0x55/0x3b0
[ 8948.304543]  ? process_one_work+0x5a0/0x5a0
[ 8948.304904]  kthread+0xf2/0x120
[ 8948.305184]  ? kthread_complete_and_exit+0x20/0x20
[ 8948.305598]  ret_from_fork+0x22/0x30
[ 8948.305921]  </TASK>

It all comes from the fact that btrfs_start_delalloc_roots() takes the
delalloc_root_mutex, in the transaction commit path we are holding a
read lock on one of the superblock's freeze semaphores (via
sb_start_intwrite()), the async reclaim task can also do a call to
btrfs_start_delalloc_roots(), which ends up triggering writeback with
calls to filemap_fdatawrite_wbc(), resulting in extent allocation which
in turn can call btrfs_start_transaction(), which will result in taking
the freeze semaphore via sb_start_intwrite(), forming a nasty dependency
on all those locks which can be taken in different orders by different
code paths.

So just adopt the simple approach of calling try_to_writeback_inodes_sb()
at btrfs_start_delalloc_flush().

Link: https://lore.kernel.org/linux-btrfs/20220130005258.GA7465@cuci.nl/
Link: https://lore.kernel.org/linux-btrfs/43acc426-d683-d1b6-729d-c6bc4a2fff4d@gmail.com/
Link: https://lore.kernel.org/linux-btrfs/6833930a-08d7-6fbc-0141-eb9cdfd6bb4d@gmail.com/
Link: https://lore.kernel.org/linux-btrfs/20190322041731.GF16651@hungrycats.org/
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ add more link reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
codomania pushed a commit that referenced this pull request Mar 7, 2022
Yonghong Song says:

====================

The patch [1] exposed a bpf_timer initialization bug in function
check_and_init_map_value(). With bug fix here, the patch [1]
can be applied with all selftests passed. Please see individual
patches for fix details.

  [1] https://lore.kernel.org/bpf/20220209070324.1093182-2-memxor@gmail.com/

Changelog:
  v3 -> v4:
    . move header file in patch #1 to avoid bpf-next merge conflict
  v2 -> v3:
    . switch patch #1 and patch #2 for better bisecting
  v1 -> v2:
    . add Fixes tag for patch #1
    . rebase against bpf tree
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
deeglaze added a commit to deeglaze/amdese-linux that referenced this pull request Jun 7, 2022
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

 1. Accept all the memory during the boot. It is easy to implement and
    it doesn't have runtime cost once the system is booted. The downside
    is very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of AMDESE#2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement AMDESE#2 for now. It is a reasonable default. Some workloads may
want to use AMDESE#1 or AMDESE#3 and they can be implemented later based on user's
demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages on the first allocation.
The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
used to indicate that the page requires acceptance.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - memory_is_unaccepted() checks anything within the range of physical
   addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
ashkalra pushed a commit that referenced this pull request Jun 20, 2022
The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..

A command buffer may contains a system physical address that the firmware
may write to. There are two cases that need to be handled:

1) system physical address points to a guest memory
2) system physical address points to a host memory

To handle the case #1, change the page state to the firmware in the RMP
table before issuing the command and restore the state to shared after the
command completes.

For the case #2, use a bounce buffer to complete the request.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
ashkalra pushed a commit that referenced this pull request Jul 14, 2022
The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..

A command buffer may contains a system physical address that the firmware
may write to. There are two cases that need to be handled:

1) system physical address points to a guest memory
2) system physical address points to a host memory

To handle the case #1, change the page state to the firmware in the RMP
table before issuing the command and restore the state to shared after the
command completes.

For the case #2, use a bounce buffer to complete the request.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
yghannam pushed a commit that referenced this pull request Jul 28, 2022
This was missed in c3ed222 ("NFSv4: Fix free of uninitialized
nfs4_label on referral lookup.") and causes a panic when mounting
with '-o trunkdiscovery':

PID: 1604   TASK: ffff93dac3520000  CPU: 3   COMMAND: "mount.nfs"
 #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee
 #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd
 #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed
 #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d
 #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e
    [exception RIP: nfs_fattr_init+0x5]
    RIP: ffffffffc0c18265  RSP: ffffb79140f73b08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff93dac304a800  RCX: 0000000000000000
    RDX: ffffb79140f73bb0  RSI: ffff93dadc8cbb40  RDI: d03ee11cfaf6bd50
    RBP: ffffb79140f73be8   R8: ffffffffc0691560   R9: 0000000000000006
    R10: ffff93db3ffd3df8  R11: 0000000000000000  R12: ffff93dac4040000
    R13: ffff93dac2848e00  R14: ffffb79140f73b60  R15: ffffb79140f73b30
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4]
 #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4]
 torvalds#7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4]
 torvalds#8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs]
 torvalds#9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs]
    RIP: 00007f6254fce26e  RSP: 00007ffc69496ac8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f6254fce26e
    RDX: 00005600220a82a0  RSI: 00005600220a64d0  RDI: 00005600220a6520
    RBP: 00007ffc69496c50   R8: 00005600220a8710   R9: 003035322e323231
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc69496c50
    R13: 00005600220a8440  R14: 0000000000000010  R15: 0000560020650ef9
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Fixes: c3ed222 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
yghannam pushed a commit that referenced this pull request Jul 28, 2022
…nline extents

When doing a direct IO read or write, we always return -ENOTBLK when we
find a compressed extent (or an inline extent) so that we fallback to
buffered IO. This however is not ideal in case we are in a NOWAIT context
(io_uring for example), because buffered IO can block and we currently
have no support for NOWAIT semantics for buffered IO, so if we need to
fallback to buffered IO we should first signal the caller that we may
need to block by returning -EAGAIN instead.

This behaviour can also result in short reads being returned to user
space, which although it's not incorrect and user space should be able
to deal with partial reads, it's somewhat surprising and even some popular
applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
deal with short reads properly (or at all).

The short read case happens when we try to read from a range that has a
non-compressed and non-inline extent followed by a compressed extent.
After having read the first extent, when we find the compressed extent we
return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
treat the request as a short read, returning 0 (success) and waiting for
previously submitted bios to complete (this happens at
fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
read the remaining data, and pass it the number of bytes we were able to
read with direct IO. Than at filemap_read() if we get a page fault error
when accessing the read buffer, we return a partial read instead of an
-EFAULT error, because the number of bytes previously read is greater
than zero.

So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
compressed or an inline extent.

Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
yghannam pushed a commit that referenced this pull request Jul 28, 2022
…ernel/git/at91/linux into arm/fixes

AT91 fixes for 5.19 #2

It contains 2 DT fixes:
- one for SAMA5D2 to fix the i2s1 assigned-clock-parents property
- one for kswitch-d10 (LAN966 based) enforcing proper settings
  on GPIO pins

* tag 'at91-fixes-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux:
  ARM: dts: at91: sama5d2: Fix typo in i2s1 node
  ARM: dts: kswitch-d10: use open drain mode for coma-mode pins

Link: https://lore.kernel.org/r/20220708151621.860339-1-claudiu.beznea@microchip.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
yghannam pushed a commit that referenced this pull request Jul 28, 2022
 into HEAD

 KVM/riscv fixes for 5.19, take #2

- Fix missing PAGE_PFN_MASK

- Fix SRCU deadlock caused by kvm_riscv_check_vcpu_requests()
yghannam pushed a commit that referenced this pull request Jul 28, 2022
On powerpc, 'perf trace' is crashing with a SIGSEGV when trying to
process a perf.data file created with 'perf trace record -p':

  #0  0x00000001225b8988 in syscall_arg__scnprintf_augmented_string <snip> at builtin-trace.c:1492
  #1  syscall_arg__scnprintf_filename <snip> at builtin-trace.c:1492
  #2  syscall_arg__scnprintf_filename <snip> at builtin-trace.c:1486
  #3  0x00000001225bdd9c in syscall_arg_fmt__scnprintf_val <snip> at builtin-trace.c:1973
  #4  syscall__scnprintf_args <snip> at builtin-trace.c:2041
  #5  0x00000001225bff04 in trace__sys_enter <snip> at builtin-trace.c:2319

That points to the below code in tools/perf/builtin-trace.c:
	/*
	 * If this is raw_syscalls.sys_enter, then it always comes with the 6 possible
	 * arguments, even if the syscall being handled, say "openat", uses only 4 arguments
	 * this breaks syscall__augmented_args() check for augmented args, as we calculate
	 * syscall->args_size using each syscalls:sys_enter_NAME tracefs format file,
	 * so when handling, say the openat syscall, we end up getting 6 args for the
	 * raw_syscalls:sys_enter event, when we expected just 4, we end up mistakenly
	 * thinking that the extra 2 u64 args are the augmented filename, so just check
	 * here and avoid using augmented syscalls when the evsel is the raw_syscalls one.
	 */
	if (evsel != trace->syscalls.events.sys_enter)
		augmented_args = syscall__augmented_args(sc, sample, &augmented_args_size, trace->raw_augmented_syscalls_args_size);

As the comment points out, we should not be trying to augment the args
for raw_syscalls. However, when processing a perf.data file, we are not
initializing those properly. Fix the same.

Reported-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/20220707090900.572584-1-naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
yghannam pushed a commit that referenced this pull request Jul 28, 2022
This reverts commit 912f655.

This commit introduced a regression that can cause mount hung.  The
changes in __ocfs2_find_empty_slot causes that any node with none-zero
node number can grab the slot that was already taken by node 0, so node 1
will access the same journal with node 0, when it try to grab journal
cluster lock, it will hung because it was already acquired by node 0. 
It's very easy to reproduce this, in one cluster, mount node 0 first, then
node 1, you will see the following call trace from node 1.

[13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
[13148.739691]       Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
[13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13148.745846] task:mount.ocfs2     state:D stack:    0 pid:53045 ppid: 53044 flags:0x00004000
[13148.749354] Call Trace:
[13148.750718]  <TASK>
[13148.752019]  ? usleep_range+0x90/0x89
[13148.753882]  __schedule+0x210/0x567
[13148.755684]  schedule+0x44/0xa8
[13148.757270]  schedule_timeout+0x106/0x13c
[13148.759273]  ? __prepare_to_swait+0x53/0x78
[13148.761218]  __wait_for_common+0xae/0x163
[13148.763144]  __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
[13148.765780]  ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
[13148.768312]  ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
[13148.770968]  ocfs2_journal_init+0x91/0x340 [ocfs2]
[13148.773202]  ocfs2_check_volume+0x39/0x461 [ocfs2]
[13148.775401]  ? iput+0x69/0xba
[13148.777047]  ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
[13148.779646]  ocfs2_fill_super+0x54b/0x853 [ocfs2]
[13148.781756]  mount_bdev+0x190/0x1b7
[13148.783443]  ? ocfs2_remount+0x440/0x440 [ocfs2]
[13148.785634]  legacy_get_tree+0x27/0x48
[13148.787466]  vfs_get_tree+0x25/0xd0
[13148.789270]  do_new_mount+0x18c/0x2d9
[13148.791046]  __x64_sys_mount+0x10e/0x142
[13148.792911]  do_syscall_64+0x3b/0x89
[13148.794667]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[13148.797051] RIP: 0033:0x7f2309f6e26e
[13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
[13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
[13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
[13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
[13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
[13148.816564]  </TASK>

To fix it, we can just fix __ocfs2_find_empty_slot.  But original commit
introduced the feature to mount ocfs2 locally even it is cluster based,
that is a very dangerous, it can easily cause serious data corruption,
there is no way to stop other nodes mounting the fs and corrupting it. 
Setup ha or other cluster-aware stack is just the cost that we have to
take for avoiding corruption, otherwise we have to do it in kernel.

Link: https://lkml.kernel.org/r/20220603222801.42488-1-junxiao.bi@oracle.com
Fixes: 912f655("ocfs2: mount shared volume without ha stack")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ashkalra pushed a commit that referenced this pull request Dec 6, 2022
The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..

A command buffer may contains a system physical address that the firmware
may write to. There are two cases that need to be handled:

1) system physical address points to a guest memory
2) system physical address points to a host memory

To handle the case #1, change the page state to the firmware in the RMP
table before issuing the command and restore the state to shared after the
command completes.

For the case #2, use a bounce buffer to complete the request.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
mdroth pushed a commit that referenced this pull request Dec 14, 2022
The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..

A command buffer may contains a system physical address that the firmware
may write to. There are two cases that need to be handled:

1) system physical address points to a guest memory
2) system physical address points to a host memory

To handle the case #1, change the page state to the firmware in the RMP
table before issuing the command and restore the state to shared after the
command completes.

For the case #2, use a bounce buffer to complete the request.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
mdroth pushed a commit that referenced this pull request Feb 20, 2023
The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..

A command buffer may contains a system physical address that the firmware
may write to. There are two cases that need to be handled:

1) system physical address points to a guest memory
2) system physical address points to a host memory

To handle the case #1, change the page state to the firmware in the RMP
table before issuing the command and restore the state to shared after the
command completes.

For the case #2, use a bounce buffer to complete the request.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
evtHsa pushed a commit that referenced this pull request Mar 9, 2023
We shouldn't be calling runtime PM APIs from within the genpd
enable/disable path for a couple reasons.

First, this causes an AA lockdep splat[1] because genpd can call into
genpd code again while holding the genpd lock.

WARNING: possible recursive locking detected
5.19.0-rc2-lockdep+ torvalds#7 Not tainted
--------------------------------------------
kworker/2:1/49 is trying to acquire lock:
ffffffeea0370788 (&genpd->mlock){+.+.}-{3:3}, at: genpd_lock_mtx+0x24/0x30

but task is already holding lock:
ffffffeea03710a8 (&genpd->mlock){+.+.}-{3:3}, at: genpd_lock_mtx+0x24/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&genpd->mlock);
  lock(&genpd->mlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by kworker/2:1/49:
 #0: 74ffff80811a5748 ((wq_completion)pm){+.+.}-{0:0}, at: process_one_work+0x320/0x5fc
 #1: ffffffc008537cf8 ((work_completion)(&genpd->power_off_work)){+.+.}-{0:0}, at: process_one_work+0x354/0x5fc
 #2: ffffffeea03710a8 (&genpd->mlock){+.+.}-{3:3}, at: genpd_lock_mtx+0x24/0x30

stack backtrace:
CPU: 2 PID: 49 Comm: kworker/2:1 Not tainted 5.19.0-rc2-lockdep+ torvalds#7
Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT)
Workqueue: pm genpd_power_off_work_fn
Call trace:
 dump_backtrace+0x1a0/0x200
 show_stack+0x24/0x30
 dump_stack_lvl+0x7c/0xa0
 dump_stack+0x18/0x44
 __lock_acquire+0xb38/0x3634
 lock_acquire+0x180/0x2d4
 __mutex_lock_common+0x118/0xe30
 mutex_lock_nested+0x70/0x7c
 genpd_lock_mtx+0x24/0x30
 genpd_runtime_suspend+0x2f0/0x414
 __rpm_callback+0xdc/0x1b8
 rpm_callback+0x4c/0xcc
 rpm_suspend+0x21c/0x5f0
 rpm_idle+0x17c/0x1e0
 __pm_runtime_idle+0x78/0xcc
 gdsc_disable+0x24c/0x26c
 _genpd_power_off+0xd4/0x1c4
 genpd_power_off+0x2d8/0x41c
 genpd_power_off_work_fn+0x60/0x94
 process_one_work+0x398/0x5fc
 worker_thread+0x42c/0x6c4
 kthread+0x194/0x1b4
 ret_from_fork+0x10/0x20

Second, this confuses runtime PM on CoachZ for the camera devices by
causing the camera clock controller's runtime PM usage_count to go
negative after resuming from suspend. This is because runtime PM is
being used on the clock controller while runtime PM is disabled for the
device.

The reason for the negative count is because a GDSC is represented as a
genpd and each genpd that is attached to a device is resumed during the
noirq phase of system wide suspend/resume (see the noirq suspend ops
assignment in pm_genpd_init() for more details). The camera GDSCs are
attached to camera devices with the 'power-domains' property in DT.
Every device has runtime PM disabled in the late system suspend phase
via __device_suspend_late(). Runtime PM is not usable until runtime PM
is enabled in device_resume_early(). The noirq phases run after the
'late' and before the 'early' phase of suspend/resume. When the genpds
are resumed in genpd_resume_noirq(), we call down into gdsc_enable()
that calls pm_runtime_resume_and_get() and that returns -EACCES to
indicate failure to resume because runtime PM is disabled for all
devices.

Upon closer inspection, calling runtime PM APIs like this in the GDSC
driver doesn't make sense. It was intended to make sure the GDSC for the
clock controller providing other GDSCs was enabled, specifically the
MMCX GDSC for the display clk controller on SM8250 (sm8250-dispcc), so
that GDSC register accesses succeeded. That will already happen because
we make the 'dev->pm_domain' a parent domain of each GDSC we register in
gdsc_register() via pm_genpd_add_subdomain(). When any of these GDSCs
are accessed, we'll enable the parent domain (in this specific case
MMCX).

We also remove any getting of runtime PM during registration, because
when a genpd is registered it increments the count on the parent if the
genpd itself is already enabled.

Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Johan Hovold <johan+linaro@kernel.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Taniya Das <quic_tdas@quicinc.com>
Cc: Satya Priya <quic_c_skakit@quicinc.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Reported-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/CAE-0n52xbZeJ66RaKwggeRB57fUAwjvxGxfFMKOKJMKVyFTe+w@mail.gmail.com [1]
Fixes: 1b77183 ("clk: qcom: gdsc: enable optional power domain support")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/20221103183030.3594899-1-swboyd@chromium.org
Tested-by: Johan Hovold <johan+linaro@kernel.org>
Reviewed-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
evtHsa pushed a commit that referenced this pull request Mar 9, 2023
esw_attr is only allocated if namespace is fdb.

BUG: KASAN: slab-out-of-bounds in parse_tc_actions+0xdc6/0x10e0 [mlx5_core]
Write of size 4 at addr ffff88815f185b04 by task tc/2135

CPU: 5 PID: 2135 Comm: tc Not tainted 6.1.0-rc2+ #2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x57/0x7d
 print_report+0x170/0x471
 ? parse_tc_actions+0xdc6/0x10e0 [mlx5_core]
 kasan_report+0xbc/0xf0
 ? parse_tc_actions+0xdc6/0x10e0 [mlx5_core]
 parse_tc_actions+0xdc6/0x10e0 [mlx5_core]

Fixes: 94d6517 ("net/mlx5e: TC, Fix cloned flow attr instance dests are not zeroed")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
haraldh pushed a commit to haraldh/linux that referenced this pull request Dec 13, 2024
[ Upstream commit f10a890 ]

syzbot reports deadlock issue of f2fs as below:

======================================================
WARNING: possible circular locking dependency detected
6.12.0-rc3-syzkaller-00087-gc964ced77262 #0 Not tainted
------------------------------------------------------
kswapd0/79 is trying to acquire lock:
ffff888011824088 (&sbi->sb_lock){++++}-{3:3}, at: f2fs_down_write fs/f2fs/f2fs.h:2199 [inline]
ffff888011824088 (&sbi->sb_lock){++++}-{3:3}, at: f2fs_record_stop_reason+0x52/0x1d0 fs/f2fs/super.c:4068

but task is already holding lock:
ffff88804bd92610 (sb_internal#2){.+.+}-{0:0}, at: f2fs_evict_inode+0x662/0x15c0 fs/f2fs/inode.c:842

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> AMDESE#2 (sb_internal#2){.+.+}-{0:0}:
       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write include/linux/fs.h:1716 [inline]
       sb_start_intwrite+0x4d/0x1c0 include/linux/fs.h:1899
       f2fs_evict_inode+0x662/0x15c0 fs/f2fs/inode.c:842
       evict+0x4e8/0x9b0 fs/inode.c:725
       f2fs_evict_inode+0x1a4/0x15c0 fs/f2fs/inode.c:807
       evict+0x4e8/0x9b0 fs/inode.c:725
       dispose_list fs/inode.c:774 [inline]
       prune_icache_sb+0x239/0x2f0 fs/inode.c:963
       super_cache_scan+0x38c/0x4b0 fs/super.c:223
       do_shrink_slab+0x701/0x1160 mm/shrinker.c:435
       shrink_slab+0x1093/0x14d0 mm/shrinker.c:662
       shrink_one+0x43b/0x850 mm/vmscan.c:4818
       shrink_many mm/vmscan.c:4879 [inline]
       lru_gen_shrink_node mm/vmscan.c:4957 [inline]
       shrink_node+0x3799/0x3de0 mm/vmscan.c:5937
       kswapd_shrink_node mm/vmscan.c:6765 [inline]
       balance_pgdat mm/vmscan.c:6957 [inline]
       kswapd+0x1ca3/0x3700 mm/vmscan.c:7226
       kthread+0x2f0/0x390 kernel/kthread.c:389
       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

-> AMDESE#1 (fs_reclaim){+.+.}-{0:0}:
       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
       __fs_reclaim_acquire mm/page_alloc.c:3834 [inline]
       fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3848
       might_alloc include/linux/sched/mm.h:318 [inline]
       prepare_alloc_pages+0x147/0x5b0 mm/page_alloc.c:4493
       __alloc_pages_noprof+0x16f/0x710 mm/page_alloc.c:4722
       alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
       alloc_pages_noprof mm/mempolicy.c:2345 [inline]
       folio_alloc_noprof+0x128/0x180 mm/mempolicy.c:2352
       filemap_alloc_folio_noprof+0xdf/0x500 mm/filemap.c:1010
       do_read_cache_folio+0x2eb/0x850 mm/filemap.c:3787
       read_mapping_folio include/linux/pagemap.h:1011 [inline]
       f2fs_commit_super+0x3c0/0x7d0 fs/f2fs/super.c:4032
       f2fs_record_stop_reason+0x13b/0x1d0 fs/f2fs/super.c:4079
       f2fs_handle_critical_error+0x2ac/0x5c0 fs/f2fs/super.c:4174
       f2fs_write_inode+0x35f/0x4d0 fs/f2fs/inode.c:785
       write_inode fs/fs-writeback.c:1503 [inline]
       __writeback_single_inode+0x711/0x10d0 fs/fs-writeback.c:1723
       writeback_single_inode+0x1f3/0x660 fs/fs-writeback.c:1779
       sync_inode_metadata+0xc4/0x120 fs/fs-writeback.c:2849
       f2fs_release_file+0xa8/0x100 fs/f2fs/file.c:1941
       __fput+0x23f/0x880 fs/file_table.c:431
       task_work_run+0x24f/0x310 kernel/task_work.c:228
       resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
       exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
       __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
       syscall_exit_to_user_mode+0x168/0x370 kernel/entry/common.c:218
       do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (&sbi->sb_lock){++++}-{3:3}:
       check_prev_add kernel/locking/lockdep.c:3161 [inline]
       check_prevs_add kernel/locking/lockdep.c:3280 [inline]
       validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
       __lock_acquire+0x1384/0x2050 kernel/locking/lockdep.c:5202
       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
       down_write+0x99/0x220 kernel/locking/rwsem.c:1577
       f2fs_down_write fs/f2fs/f2fs.h:2199 [inline]
       f2fs_record_stop_reason+0x52/0x1d0 fs/f2fs/super.c:4068
       f2fs_handle_critical_error+0x2ac/0x5c0 fs/f2fs/super.c:4174
       f2fs_evict_inode+0xa61/0x15c0 fs/f2fs/inode.c:883
       evict+0x4e8/0x9b0 fs/inode.c:725
       f2fs_evict_inode+0x1a4/0x15c0 fs/f2fs/inode.c:807
       evict+0x4e8/0x9b0 fs/inode.c:725
       dispose_list fs/inode.c:774 [inline]
       prune_icache_sb+0x239/0x2f0 fs/inode.c:963
       super_cache_scan+0x38c/0x4b0 fs/super.c:223
       do_shrink_slab+0x701/0x1160 mm/shrinker.c:435
       shrink_slab+0x1093/0x14d0 mm/shrinker.c:662
       shrink_one+0x43b/0x850 mm/vmscan.c:4818
       shrink_many mm/vmscan.c:4879 [inline]
       lru_gen_shrink_node mm/vmscan.c:4957 [inline]
       shrink_node+0x3799/0x3de0 mm/vmscan.c:5937
       kswapd_shrink_node mm/vmscan.c:6765 [inline]
       balance_pgdat mm/vmscan.c:6957 [inline]
       kswapd+0x1ca3/0x3700 mm/vmscan.c:7226
       kthread+0x2f0/0x390 kernel/kthread.c:389
       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

other info that might help us debug this:

Chain exists of:
  &sbi->sb_lock --> fs_reclaim --> sb_internal#2

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(sb_internal#2);
                               lock(fs_reclaim);
                               lock(sb_internal#2);
  lock(&sbi->sb_lock);

Root cause is there will be potential deadlock in between
below tasks:

Thread A				Kswapd
- f2fs_ioc_commit_atomic_write
 - mnt_want_write_file -- down_read lock A
					- balance_pgdat
					 - __fs_reclaim_acquire  -- lock B
					  - shrink_node
					   - prune_icache_sb
					    - dispose_list
					     - f2fs_evict_inode
					      - sb_start_intwrite  -- down_read lock A
 - f2fs_do_sync_file
  - f2fs_write_inode
   - f2fs_handle_critical_error
    - f2fs_record_stop_reason
     - f2fs_commit_super
      - read_mapping_folio
       - filemap_alloc_folio_noprof
        - fs_reclaim_acquire  -- lock B

Both threads try to acquire read lock of lock A, then its upcoming write
lock grabber will trigger deadlock.

Let's always create an asynchronous task in f2fs_handle_critical_error()
rather than calling f2fs_record_stop_reason() synchronously to avoid
this potential deadlock issue.

Fixes: b62e71b ("f2fs: support errors=remount-ro|continue|panic mountoption")
Reported-by: syzbot+be4a9983e95a5e25c8d3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6704d667.050a0220.1e4d62.0081.GAE@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Daejun Park <daejun7.park@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
Since the netlink attribute range validation provides inclusive
checking, the *max* of attribute NL80211_ATTR_MLO_LINK_ID should be
IEEE80211_MLD_MAX_NUM_LINKS - 1 otherwise causing an off-by-one.

One crash stack for demonstration:
==================================================================
BUG: KASAN: wild-memory-access in ieee80211_tx_control_port+0x3b6/0xca0 net/mac80211/tx.c:5939
Read of size 6 at addr 001102080000000c by task fuzzer.386/9508

CPU: 1 PID: 9508 Comm: syz.1.386 Not tainted 6.1.70 #2
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x177/0x231 lib/dump_stack.c:106
 print_report+0xe0/0x750 mm/kasan/report.c:398
 kasan_report+0x139/0x170 mm/kasan/report.c:495
 kasan_check_range+0x287/0x290 mm/kasan/generic.c:189
 memcpy+0x25/0x60 mm/kasan/shadow.c:65
 ieee80211_tx_control_port+0x3b6/0xca0 net/mac80211/tx.c:5939
 rdev_tx_control_port net/wireless/rdev-ops.h:761 [inline]
 nl80211_tx_control_port+0x7b3/0xc40 net/wireless/nl80211.c:15453
 genl_family_rcv_msg_doit+0x22e/0x320 net/netlink/genetlink.c:756
 genl_family_rcv_msg net/netlink/genetlink.c:833 [inline]
 genl_rcv_msg+0x539/0x740 net/netlink/genetlink.c:850
 netlink_rcv_skb+0x1de/0x420 net/netlink/af_netlink.c:2508
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:861
 netlink_unicast_kernel net/netlink/af_netlink.c:1326 [inline]
 netlink_unicast+0x74b/0x8c0 net/netlink/af_netlink.c:1352
 netlink_sendmsg+0x882/0xb90 net/netlink/af_netlink.c:1874
 sock_sendmsg_nosec net/socket.c:716 [inline]
 __sock_sendmsg net/socket.c:728 [inline]
 ____sys_sendmsg+0x5cc/0x8f0 net/socket.c:2499
 ___sys_sendmsg+0x21c/0x290 net/socket.c:2553
 __sys_sendmsg net/socket.c:2582 [inline]
 __do_sys_sendmsg net/socket.c:2591 [inline]
 __se_sys_sendmsg+0x19e/0x270 net/socket.c:2589
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x45/0x90 arch/x86/entry/common.c:81
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Update the policy to ensure correct validation.

Fixes: 7b0a0e3 ("wifi: cfg80211: do some rework towards MLO link APIs")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Suggested-by: Cengiz Can <cengiz.can@canonical.com>
Link: https://patch.msgid.link/20241130170526.96698-1-linma@zju.edu.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
Its used from trace__run(), for the 'perf trace' live mode, i.e. its
strace-like, non-perf.data file processing mode, the most common one.

The trace__run() function will set trace->host using machine__new_host()
that is supposed to give a machine instance representing the running
machine, and since we'll use perf_env__arch_strerrno() to get the right
errno -> string table, we need to use machine->env, so initialize it in
machine__new_host().

Before the patch:

  (gdb) run trace --errno-summary -a sleep 1
  <SNIP>
   Summary of events:

   gvfs-afc-volume (3187), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     pselect6               1      0     0.000     0.000     0.000     0.000      0.00%

   GUsbEventThread (3519), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                   1      0     0.000     0.000     0.000     0.000      0.00%
  <SNIP>
  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  478		if (env->arch_strerrno == NULL)
  (gdb) bt
  #0  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  #1  0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673
  #2  0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708
  #3  0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747
  #4  0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456
  #5  0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487
  #6  0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351
  torvalds#7  0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404
  torvalds#8  0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448
  torvalds#9  0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560
  (gdb)

After:

  root@number:~# perf trace -a --errno-summary sleep 1
  <SNIP>
     pw-data-loop (2685), 1410 events, 16.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait           188      0   983.428     0.000     5.231    15.595      8.68%
     ioctl                 94      0     0.811     0.004     0.009     0.016      2.82%
     read                 188      0     0.322     0.001     0.002     0.006      5.15%
     write                141      0     0.280     0.001     0.002     0.018      8.39%
     timerfd_settime       94      0     0.138     0.001     0.001     0.007      6.47%

   gnome-control-c (179406), 1848 events, 20.9%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                 222      0   959.577     0.000     4.322    21.414     11.40%
     recvmsg              150      0     0.539     0.001     0.004     0.013      5.12%
     write                300      0     0.442     0.001     0.001     0.007      3.29%
     read                 150      0     0.183     0.001     0.001     0.009      5.53%
     getpid               102      0     0.101     0.000     0.001     0.008      7.82%

  root@number:~#

Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()")
Reported-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Veronika Molnarova <vmolnaro@redhat.com>
Acked-by: Michael Petlan <mpetlan@redhat.com>
Tested-by: Michael Petlan <mpetlan@redhat.com>
Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
When virtnet_close is followed by virtnet_open, some TX completions can
possibly remain unconsumed, until they are finally processed during the
first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash
[1]. Commit b96ed2c ("virtio_net: move netdev_tx_reset_queue() call
before RX napi enable") was not sufficient to eliminate all BQL crash
cases for virtio-net.

This issue can be reproduced with the latest net-next master by running:
`while :; do ip l set DEV down; ip l set DEV up; done` under heavy network
TX load from inside the machine.

netdev_tx_reset_queue() can actually be dropped from virtnet_open path;
the device is not stopped in any case. For BQL core part, it's just like
traffic nearly ceases to exist for some period. For stall detector added
to BQL, even if virtnet_close could somehow lead to some TX completions
delayed for long, followed by virtnet_open, we can just take it as stall
as mentioned in commit 6025b91 ("net: dqs: add NIC stall detector
based on BQL"). Note also that users can still reset stall_max via sysfs.

So, drop netdev_tx_reset_queue() from virtnet_enable_queue_pair(). This
eliminates the BQL crashes. As a result, netdev_tx_reset_queue() is now
explicitly required in freeze/restore path. This patch adds it to
immediately after free_unused_bufs(), following the rule of thumb:
netdev_tx_reset_queue() should follow any SKB freeing not followed by
netdev_tx_completed_queue(). This seems the most consistent and
streamlined approach, and now netdev_tx_reset_queue() runs whenever
free_unused_bufs() is done.

[1]:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G    N 6.12.0net-next_main+ #2
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:dql_completed+0x26b/0x290
Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d
4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01
d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff
RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009
RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000
RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001
R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40
FS:  00007f41ded69500(0000) GS:ffff888667dc0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? die+0x32/0x80
 ? do_trap+0xd9/0x100
 ? dql_completed+0x26b/0x290
 ? dql_completed+0x26b/0x290
 ? do_error_trap+0x6d/0xb0
 ? dql_completed+0x26b/0x290
 ? exc_invalid_op+0x4c/0x60
 ? dql_completed+0x26b/0x290
 ? asm_exc_invalid_op+0x16/0x20
 ? dql_completed+0x26b/0x290
 __free_old_xmit+0xff/0x170 [virtio_net]
 free_old_xmit+0x54/0xc0 [virtio_net]
 virtnet_poll+0xf4/0xe30 [virtio_net]
 ? __update_load_avg_cfs_rq+0x264/0x2d0
 ? update_curr+0x35/0x260
 ? reweight_entity+0x1be/0x260
 __napi_poll.constprop.0+0x28/0x1c0
 net_rx_action+0x329/0x420
 ? enqueue_hrtimer+0x35/0x90
 ? trace_hardirqs_on+0x1d/0x80
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? sched_clock_cpu+0xd/0x1a0
 handle_softirqs+0x138/0x3e0
 do_softirq.part.0+0x89/0xc0
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xa7/0xb0
 virtnet_open+0xc8/0x310 [virtio_net]
 __dev_open+0xfa/0x1b0
 __dev_change_flags+0x1de/0x250
 dev_change_flags+0x22/0x60
 do_setlink.isra.0+0x2df/0x10b0
 ? rtnetlink_rcv_msg+0x34f/0x3f0
 ? netlink_rcv_skb+0x54/0x100
 ? netlink_unicast+0x23e/0x390
 ? netlink_sendmsg+0x21e/0x490
 ? ____sys_sendmsg+0x31b/0x350
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? __nla_validate_parse+0x5f/0xee0
 ? __pfx___probestub_irq_enable+0x3/0x10
 ? __create_object+0x5e/0x90
 ? security_capable+0x3b/0x70
 rtnl_newlink+0x784/0xaf0
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? stack_depot_save_flags+0x24/0x6d0
 ? __pfx_rtnl_newlink+0x10/0x10
 rtnetlink_rcv_msg+0x34f/0x3f0
 ? do_syscall_64+0x6c/0x180
 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
 ? __pfx_rtnetlink_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x54/0x100
 netlink_unicast+0x23e/0x390
 netlink_sendmsg+0x21e/0x490
 ____sys_sendmsg+0x31b/0x350
 ? copy_msghdr_from_user+0x6d/0xa0
 ___sys_sendmsg+0x86/0xd0
 ? __pte_offset_map+0x17/0x160
 ? preempt_count_add+0x69/0xa0
 ? __call_rcu_common.constprop.0+0x147/0x610
 ? preempt_count_add+0x69/0xa0
 ? preempt_count_add+0x69/0xa0
 ? _raw_spin_trylock+0x13/0x60
 ? trace_hardirqs_on+0x1d/0x80
 __sys_sendmsg+0x66/0xc0
 do_syscall_64+0x6c/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f41defe5b34
Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00
f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34
RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003
RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001
R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000
 </TASK>
[...]

Fixes: c8bd1f7 ("virtio_net: add support for Byte Queue Limits")
Cc: <stable@vger.kernel.org> # v6.11+
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
[ pabeni: trimmed possibly troublesome separator ]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
…nts'

Koichiro Den says:

====================
virtio_net: correct netdev_tx_reset_queue() invocation points

When virtnet_close is followed by virtnet_open, some TX completions can
possibly remain unconsumed, until they are finally processed during the
first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash
[1]. Commit b96ed2c ("virtio_net: move netdev_tx_reset_queue() call
before RX napi enable") was not sufficient to eliminate all BQL crash
scenarios for virtio-net.

This issue can be reproduced with the latest net-next master by running:
`while :; do ip l set DEV down; ip l set DEV up; done` under heavy network
TX load from inside the machine.

This patch series resolves the issue and also addresses similar existing
problems:

(a). Drop netdev_tx_reset_queue() from open/close path. This eliminates the
     BQL crashes due to the problematic open/close path.

(b). As a result of (a), netdev_tx_reset_queue() is now explicitly required
     in freeze/restore path. Add netdev_tx_reset_queue() immediately after
     free_unused_bufs() invocation.

(c). Fix missing resetting in virtnet_tx_resize().
     virtnet_tx_resize() has lacked proper resetting since commit
     c8bd1f7 ("virtio_net: add support for Byte Queue Limits").

(d). Fix missing resetting in the XDP_SETUP_XSK_POOL path.
     Similar to (c), this path lacked proper resetting. Call
     netdev_tx_reset_queue() when virtqueue_reset() has actually recycled
     unused buffers.

This patch series consists of six commits:
  [1/6]: Resolves (a) and (b).                      # also -stable 6.11.y
  [2/6]: Minor fix to make [4/6] streamlined.
  [3/6]: Prerequisite for (c).                      # also -stable 6.11.y
  [4/6]: Resolves (c) (incl. Prerequisite for (d))  # also -stable 6.11.y
  [5/6]: Preresuisite for (d).
  [6/6]: Resolves (d).

Changes for v4:
  - move netdev_tx_reset_queue() out of free_unused_bufs()
  - submit to net, not net-next
Changes for v3:
  - replace 'flushed' argument with 'recycle_done'
Changes for v2:
  - add tx queue resetting for (b) to (d) above

v3: https://lore.kernel.org/all/20241204050724.307544-1-koichiro.den@canonical.com/
v2: https://lore.kernel.org/all/20241203073025.67065-1-koichiro.den@canonical.com/
v1: https://lore.kernel.org/all/20241130181744.3772632-1-koichiro.den@canonical.com/

[1]:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G    N 6.12.0net-next_main+ #2
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:dql_completed+0x26b/0x290
Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d
4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01
d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff
RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009
RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000
RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001
R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40
FS:  00007f41ded69500(0000) GS:ffff888667dc0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? die+0x32/0x80
 ? do_trap+0xd9/0x100
 ? dql_completed+0x26b/0x290
 ? dql_completed+0x26b/0x290
 ? do_error_trap+0x6d/0xb0
 ? dql_completed+0x26b/0x290
 ? exc_invalid_op+0x4c/0x60
 ? dql_completed+0x26b/0x290
 ? asm_exc_invalid_op+0x16/0x20
 ? dql_completed+0x26b/0x290
 __free_old_xmit+0xff/0x170 [virtio_net]
 free_old_xmit+0x54/0xc0 [virtio_net]
 virtnet_poll+0xf4/0xe30 [virtio_net]
 ? __update_load_avg_cfs_rq+0x264/0x2d0
 ? update_curr+0x35/0x260
 ? reweight_entity+0x1be/0x260
 __napi_poll.constprop.0+0x28/0x1c0
 net_rx_action+0x329/0x420
 ? enqueue_hrtimer+0x35/0x90
 ? trace_hardirqs_on+0x1d/0x80
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? sched_clock_cpu+0xd/0x1a0
 handle_softirqs+0x138/0x3e0
 do_softirq.part.0+0x89/0xc0
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xa7/0xb0
 virtnet_open+0xc8/0x310 [virtio_net]
 __dev_open+0xfa/0x1b0
 __dev_change_flags+0x1de/0x250
 dev_change_flags+0x22/0x60
 do_setlink.isra.0+0x2df/0x10b0
 ? rtnetlink_rcv_msg+0x34f/0x3f0
 ? netlink_rcv_skb+0x54/0x100
 ? netlink_unicast+0x23e/0x390
 ? netlink_sendmsg+0x21e/0x490
 ? ____sys_sendmsg+0x31b/0x350
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? __nla_validate_parse+0x5f/0xee0
 ? __pfx___probestub_irq_enable+0x3/0x10
 ? __create_object+0x5e/0x90
 ? security_capable+0x3b/0x7�[I0
 rtnl_newlink+0x784/0xaf0
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? stack_depot_save_flags+0x24/0x6d0
 ? __pfx_rtnl_newlink+0x10/0x10
 rtnetlink_rcv_msg+0x34f/0x3f0
 ? do_syscall_64+0x6c/0x180
 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
 ? __pfx_rtnetlink_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x54/0x100
 netlink_unicast+0x23e/0x390
 netlink_sendmsg+0x21e/0x490
 ____sys_sendmsg+0x31b/0x350
 ? copy_msghdr_from_user+0x6d/0xa0
 ___sys_sendmsg+0x86/0xd0
 ? __pte_offset_map+0x17/0x160
 ? preempt_count_add+0x69/0xa0
 ? __call_rcu_common.constprop.0+0x147/0x610
 ? preempt_count_add+0x69/0xa0
 ? preempt_count_add+0x69/0xa0
 ? _raw_spin_trylock+0x13/0x60
 ? trace_hardirqs_on+0x1d/0x80
 __sys_sendmsg+0x66/0xc0
 do_syscall_64+0x6c/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f41defe5b34
Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00
f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34
RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003
RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001
R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000
 </TASK>
[...]
====================

Link: https://patch.msgid.link/20241206011047.923923-1-koichiro.den@canonical.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
…ux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.13, part #2

 - Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
   SPE/TRBE initialization

 - Align nested page table walker with the intended memory attribute
   combining rules of the architecture

 - Prevent userspace from constraining the advertised ASID width,
   avoiding horrors of guest TLBIs not matching the intended context in
   hardware

 - Don't leak references on LPIs when insertion into the translation
   cache fails
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
The vmemmap's, which is used for RV64 with SPARSEMEM_VMEMMAP, page
tables are populated using pmd (page middle directory) hugetables.
However, the pmd allocation is not using the generic mechanism used by
the VMA code (e.g. pmd_alloc()), or the RISC-V specific
create_pgd_mapping()/alloc_pmd_late(). Instead, the vmemmap page table
code allocates a page, and calls vmemmap_set_pmd(). This results in
that the pmd ctor is *not* called, nor would it make sense to do so.

Now, when tearing down a vmemmap page table pmd, the cleanup code
would unconditionally, and incorrectly call the pmd dtor, which
results in a crash (best case).

This issue was found when running the HMM selftests:

  | tools/testing/selftests/mm# ./test_hmm.sh smoke
  | ... # when unloading the test_hmm.ko module
  | page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10915b
  | flags: 0x1000000000000000(node=0|zone=1)
  | raw: 1000000000000000 0000000000000000 dead000000000122 0000000000000000
  | raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
  | page dumped because: VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte)
  | ------------[ cut here ]------------
  | kernel BUG at include/linux/mm.h:3080!
  | Kernel BUG [#1]
  | Modules linked in: test_hmm(-) sch_fq_codel fuse drm drm_panel_orientation_quirks backlight dm_mod
  | CPU: 1 UID: 0 PID: 514 Comm: modprobe Tainted: G        W          6.12.0-00982-gf2a4f1682d07 #2
  | Tainted: [W]=WARN
  | Hardware name: riscv-virtio qemu/qemu, BIOS 2024.10 10/01/2024
  | epc : remove_pgd_mapping+0xbec/0x1070
  |  ra : remove_pgd_mapping+0xbec/0x1070
  | epc : ffffffff80010a68 ra : ffffffff80010a68 sp : ff20000000a73940
  |  gp : ffffffff827b2d88 tp : ff6000008785da40 t0 : ffffffff80fbce04
  |  t1 : 0720072007200720 t2 : 706d756420656761 s0 : ff20000000a73a50
  |  s1 : ff6000008915cff8 a0 : 0000000000000039 a1 : 0000000000000008
  |  a2 : ff600003fff0de20 a3 : 0000000000000000 a4 : 0000000000000000
  |  a5 : 0000000000000000 a6 : c0000000ffffefff a7 : ffffffff824469b8
  |  s2 : ff1c0000022456c0 s3 : ff1ffffffdbfffff s4 : ff6000008915c000
  |  s5 : ff6000008915c000 s6 : ff6000008915c000 s7 : ff1ffffffdc00000
  |  s8 : 0000000000000001 s9 : ff1ffffffdc00000 s10: ffffffff819a31f0
  |  s11: ffffffffffffffff t3 : ffffffff8000c950 t4 : ff60000080244f00
  |  t5 : ff60000080244000 t6 : ff20000000a73708
  | status: 0000000200000120 badaddr: ffffffff80010a68 cause: 0000000000000003
  | [<ffffffff80010a68>] remove_pgd_mapping+0xbec/0x1070
  | [<ffffffff80fd238e>] vmemmap_free+0x14/0x1e
  | [<ffffffff8032e698>] section_deactivate+0x220/0x452
  | [<ffffffff8032ef7e>] sparse_remove_section+0x4a/0x58
  | [<ffffffff802f8700>] __remove_pages+0x7e/0xba
  | [<ffffffff803760d8>] memunmap_pages+0x2bc/0x3fe
  | [<ffffffff02a3ca28>] dmirror_device_remove_chunks+0x2ea/0x518 [test_hmm]
  | [<ffffffff02a3e026>] hmm_dmirror_exit+0x3e/0x1018 [test_hmm]
  | [<ffffffff80102c14>] __riscv_sys_delete_module+0x15a/0x2a6
  | [<ffffffff80fd020c>] do_trap_ecall_u+0x1f2/0x266
  | [<ffffffff80fde0a2>] _new_vmalloc_restore_context_a0+0xc6/0xd2
  | Code: bf51 7597 0184 8593 76a5 854a 4097 0029 80e7 2c00 (9002) 7597
  | ---[ end trace 0000000000000000 ]---
  | Kernel panic - not syncing: Fatal exception in interrupt

Add a check to avoid calling the pmd dtor, if the calling context is
vmemmap_free().

Fixes: c75a74f ("riscv: mm: Add memory hotplugging support")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20241120131203.1859787-1-bjorn@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
This reworks hci_cb_list to not use mutex hci_cb_list_lock to avoid bugs
like the bellow:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5070, name: kworker/u9:2
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
4 locks held by kworker/u9:2/5070:
 #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3229 [inline]
 #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_scheduled_works+0x8e0/0x1770 kernel/workqueue.c:3335
 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3230 [inline]
 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_scheduled_works+0x91b/0x1770 kernel/workqueue.c:3335
 #2: ffff8880665d0078 (&hdev->lock){+.+.}-{3:3}, at: hci_le_create_big_complete_evt+0xcf/0xae0 net/bluetooth/hci_event.c:6914
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline]
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline]
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: hci_le_create_big_complete_evt+0xdb/0xae0 net/bluetooth/hci_event.c:6915
CPU: 0 PID: 5070 Comm: kworker/u9:2 Not tainted 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: hci0 hci_rx_work
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 __might_resched+0x5d4/0x780 kernel/sched/core.c:10187
 __mutex_lock_common kernel/locking/mutex.c:585 [inline]
 __mutex_lock+0xc1/0xd70 kernel/locking/mutex.c:752
 hci_connect_cfm include/net/bluetooth/hci_core.h:2004 [inline]
 hci_le_create_big_complete_evt+0x3d9/0xae0 net/bluetooth/hci_event.c:6939
 hci_event_func net/bluetooth/hci_event.c:7514 [inline]
 hci_event_packet+0xa53/0x1540 net/bluetooth/hci_event.c:7569
 hci_rx_work+0x3e8/0xca0 net/bluetooth/hci_core.c:4171
 process_one_work kernel/workqueue.c:3254 [inline]
 process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
 kthread+0x2f0/0x390 kernel/kthread.c:388
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
 </TASK>

Reported-by: syzbot+2fb0835e0c9cefc34614@syzkaller.appspotmail.com
Tested-by: syzbot+2fb0835e0c9cefc34614@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2fb0835e0c9cefc34614
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
This fixes the circular locking dependency warning below, by
releasing the socket lock before enterning iso_listen_bis, to
avoid any potential deadlock with hdev lock.

[   75.307983] ======================================================
[   75.307984] WARNING: possible circular locking dependency detected
[   75.307985] 6.12.0-rc6+ torvalds#22 Not tainted
[   75.307987] ------------------------------------------------------
[   75.307987] kworker/u81:2/2623 is trying to acquire lock:
[   75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO)
               at: iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308021]
               but task is already holding lock:
[   75.308022] ffff8fdd61a10078 (&hdev->lock)
               at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308053]
               which lock already depends on the new lock.

[   75.308054]
               the existing dependency chain (in reverse order) is:
[   75.308055]
               -> #1 (&hdev->lock){+.+.}-{3:3}:
[   75.308057]        __mutex_lock+0xad/0xc50
[   75.308061]        mutex_lock_nested+0x1b/0x30
[   75.308063]        iso_sock_listen+0x143/0x5c0 [bluetooth]
[   75.308085]        __sys_listen_socket+0x49/0x60
[   75.308088]        __x64_sys_listen+0x4c/0x90
[   75.308090]        x64_sys_call+0x2517/0x25f0
[   75.308092]        do_syscall_64+0x87/0x150
[   75.308095]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   75.308098]
               -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[   75.308100]        __lock_acquire+0x155e/0x25f0
[   75.308103]        lock_acquire+0xc9/0x300
[   75.308105]        lock_sock_nested+0x32/0x90
[   75.308107]        iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308128]        hci_connect_cfm+0x6c/0x190 [bluetooth]
[   75.308155]        hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth]
[   75.308180]        hci_le_meta_evt+0xe7/0x200 [bluetooth]
[   75.308206]        hci_event_packet+0x21f/0x5c0 [bluetooth]
[   75.308230]        hci_rx_work+0x3ae/0xb10 [bluetooth]
[   75.308254]        process_one_work+0x212/0x740
[   75.308256]        worker_thread+0x1bd/0x3a0
[   75.308258]        kthread+0xe4/0x120
[   75.308259]        ret_from_fork+0x44/0x70
[   75.308261]        ret_from_fork_asm+0x1a/0x30
[   75.308263]
               other info that might help us debug this:

[   75.308264]  Possible unsafe locking scenario:

[   75.308264]        CPU0                CPU1
[   75.308265]        ----                ----
[   75.308265]   lock(&hdev->lock);
[   75.308267]                            lock(sk_lock-
                                                AF_BLUETOOTH-BTPROTO_ISO);
[   75.308268]                            lock(&hdev->lock);
[   75.308269]   lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO);
[   75.308270]
                *** DEADLOCK ***

[   75.308271] 4 locks held by kworker/u81:2/2623:
[   75.308272]  #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0},
                at: process_one_work+0x443/0x740
[   75.308276]  #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)),
                at: process_one_work+0x1ce/0x740
[   75.308280]  #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3}
                at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308304]  #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2},
                at: hci_connect_cfm+0x29/0x190 [bluetooth]

Fixes: 02171da ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
This fixes the circular locking dependency warning below, by reworking
iso_sock_recvmsg, to ensure that the socket lock is always released
before calling a function that locks hdev.

[  561.670344] ======================================================
[  561.670346] WARNING: possible circular locking dependency detected
[  561.670349] 6.12.0-rc6+ torvalds#26 Not tainted
[  561.670351] ------------------------------------------------------
[  561.670353] iso-tester/3289 is trying to acquire lock:
[  561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3},
               at: iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670405]
               but task is already holding lock:
[  561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0},
               at: iso_sock_recvmsg+0xbf/0x500 [bluetooth]
[  561.670450]
               which lock already depends on the new lock.

[  561.670452]
               the existing dependency chain (in reverse order) is:
[  561.670453]
               -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}:
[  561.670458]        lock_acquire+0x7c/0xc0
[  561.670463]        lock_sock_nested+0x3b/0xf0
[  561.670467]        bt_accept_dequeue+0x1a5/0x4d0 [bluetooth]
[  561.670510]        iso_sock_accept+0x271/0x830 [bluetooth]
[  561.670547]        do_accept+0x3dd/0x610
[  561.670550]        __sys_accept4+0xd8/0x170
[  561.670553]        __x64_sys_accept+0x74/0xc0
[  561.670556]        x64_sys_call+0x17d6/0x25f0
[  561.670559]        do_syscall_64+0x87/0x150
[  561.670563]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670567]
               -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[  561.670571]        lock_acquire+0x7c/0xc0
[  561.670574]        lock_sock_nested+0x3b/0xf0
[  561.670577]        iso_sock_listen+0x2de/0xf30 [bluetooth]
[  561.670617]        __sys_listen_socket+0xef/0x130
[  561.670620]        __x64_sys_listen+0xe1/0x190
[  561.670623]        x64_sys_call+0x2517/0x25f0
[  561.670626]        do_syscall_64+0x87/0x150
[  561.670629]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670632]
               -> #0 (&hdev->lock){+.+.}-{3:3}:
[  561.670636]        __lock_acquire+0x32ad/0x6ab0
[  561.670639]        lock_acquire.part.0+0x118/0x360
[  561.670642]        lock_acquire+0x7c/0xc0
[  561.670644]        __mutex_lock+0x18d/0x12f0
[  561.670647]        mutex_lock_nested+0x1b/0x30
[  561.670651]        iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670687]        iso_sock_recvmsg+0x3e9/0x500 [bluetooth]
[  561.670722]        sock_recvmsg+0x1d5/0x240
[  561.670725]        sock_read_iter+0x27d/0x470
[  561.670727]        vfs_read+0x9a0/0xd30
[  561.670731]        ksys_read+0x1a8/0x250
[  561.670733]        __x64_sys_read+0x72/0xc0
[  561.670736]        x64_sys_call+0x1b12/0x25f0
[  561.670738]        do_syscall_64+0x87/0x150
[  561.670741]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670744]
               other info that might help us debug this:

[  561.670745] Chain exists of:
&hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH

[  561.670751]  Possible unsafe locking scenario:

[  561.670753]        CPU0                    CPU1
[  561.670754]        ----                    ----
[  561.670756]   lock(sk_lock-AF_BLUETOOTH);
[  561.670758]                                lock(sk_lock
                                              AF_BLUETOOTH-BTPROTO_ISO);
[  561.670761]                                lock(sk_lock-AF_BLUETOOTH);
[  561.670764]   lock(&hdev->lock);
[  561.670767]
                *** DEADLOCK ***

Fixes: 07a9342 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
Aishwarya reports that warnings are sometimes seen when running the
ftrace kselftests, e.g.

| WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0
| Modules linked in:
| CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x4a0/0x4c0
| lr : arch_stack_walk+0x248/0x4c0
| sp : ffff800083643d20
| x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928
| x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80
| x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0
| x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000
| x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000
| x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001
| x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0
| x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690
| x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8
| Call trace:
|  arch_stack_walk+0x4a0/0x4c0 (P)
|  arch_stack_walk+0x248/0x4c0 (L)
|  profile_pc+0x44/0x80
|  profile_tick+0x50/0x80 (F)
|  tick_nohz_handler+0xcc/0x160 (F)
|  __hrtimer_run_queues+0x2ac/0x340 (F)
|  hrtimer_interrupt+0xf4/0x268 (F)
|  arch_timer_handler_virt+0x34/0x60 (F)
|  handle_percpu_devid_irq+0x88/0x220 (F)
|  generic_handle_domain_irq+0x34/0x60 (F)
|  gic_handle_irq+0x54/0x140 (F)
|  call_on_irq_stack+0x24/0x58 (F)
|  do_interrupt_handler+0x88/0x98
|  el1_interrupt+0x34/0x68 (F)
|  el1h_64_irq_handler+0x18/0x28
|  el1h_64_irq+0x6c/0x70
|  queued_spin_lock_slowpath+0x78/0x460 (P)

The warning in question is:

  WARN_ON_ONCE(state->common.pc == orig_pc))

... in kunwind_recover_return_address(), which is triggered when
return_to_handler() is encountered in the trace, but
ftrace_graph_ret_addr() cannot find a corresponding original return
address on the fgraph return stack.

This happens because the stacktrace code encounters an exception
boundary where the LR was not live at the time of the exception, but the
LR happens to contain return_to_handler(); either because the task
recently returned there, or due to unfortunate usage of the LR at a
scratch register. In such cases attempts to recover the return address
via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE()
above and aborting the unwind (hence the stacktrace terminating after
reporting the PC at the time of the exception).

Handling unreliable LR values in these cases is likely to require some
larger rework, so for the moment avoid this problem by restoring the old
behaviour of skipping the LR at exception boundaries, which the
stacktrace code did prior to commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

This commit is effectively a partial revert, keeping the structures and
logic to explicitly identify exception boundaries while still skipping
reporting of the LR. The logic to explicitly identify exception
boundaries is still useful for general robustness and as a building
block for future support for RELIABLE_STACKTRACE.

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241211140704.2498712-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.

To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.

Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:

  42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")

... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:

  f8a00ce ("proc: restrict kernel stack dumps to root")

... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):

| for n in $(seq 1 10); do
|     while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x1c8/0x370 (P)
|  stack_trace_save_tsk+0x8c/0x108
|  proc_pid_stack+0xb0/0x134
|  proc_single_show+0x60/0x120
|  seq_read_iter+0x104/0x438
|  seq_read+0xf8/0x140
|  vfs_read+0xc4/0x31c
|  ksys_read+0x70/0x108
|  __arm64_sys_read+0x1c/0x28
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x30/0xcc
|  el0t_64_sync_handler+0x10c/0x138
|  el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---

Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.

The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:

  9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241211140704.2498712-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
The current implementation removes cache tags after disabling ATS,
leading to potential memory leaks and kernel crashes. Specifically,
CACHE_TAG_DEVTLB type cache tags may still remain in the list even
after the domain is freed, causing a use-after-free condition.

This issue really shows up when multiple VFs from different PFs
passed through to a single user-space process via vfio-pci. In such
cases, the kernel may crash with kernel messages like:

 BUG: kernel NULL pointer dereference, address: 0000000000000014
 PGD 19036a067 P4D 1940a3067 PUD 136c9b067 PMD 0
 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 74 UID: 0 PID: 3183 Comm: testCli Not tainted 6.11.9 #2
 RIP: 0010:cache_tag_flush_range+0x9b/0x250
 Call Trace:
  <TASK>
  ? __die+0x1f/0x60
  ? page_fault_oops+0x163/0x590
  ? exc_page_fault+0x72/0x190
  ? asm_exc_page_fault+0x22/0x30
  ? cache_tag_flush_range+0x9b/0x250
  ? cache_tag_flush_range+0x5d/0x250
  intel_iommu_tlb_sync+0x29/0x40
  intel_iommu_unmap_pages+0xfe/0x160
  __iommu_unmap+0xd8/0x1a0
  vfio_unmap_unpin+0x182/0x340 [vfio_iommu_type1]
  vfio_remove_dma+0x2a/0xb0 [vfio_iommu_type1]
  vfio_iommu_type1_ioctl+0xafa/0x18e0 [vfio_iommu_type1]

Move cache_tag_unassign_domain() before iommu_disable_pci_caps() to fix
it.

Fixes: 3b1d9e2 ("iommu/vt-d: Add cache tag assignment interface")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241129020506.576413-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
…s_lock

For storing a value to a queue attribute, the queue_attr_store function
first freezes the queue (->q_usage_counter(io)) and then acquire
->sysfs_lock. This seems not correct as the usual ordering should be to
acquire ->sysfs_lock before freezing the queue. This incorrect ordering
causes the following lockdep splat which we are able to reproduce always
simply by accessing /sys/kernel/debug file using ls command:

[   57.597146] WARNING: possible circular locking dependency detected
[   57.597154] 6.12.0-10553-gb86545e02e8c torvalds#20 Tainted: G        W
[   57.597162] ------------------------------------------------------
[   57.597168] ls/4605 is trying to acquire lock:
[   57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0
[   57.597200]
               but task is already holding lock:
[   57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4
[   57.597226]
               which lock already depends on the new lock.

[   57.597233]
               the existing dependency chain (in reverse order) is:
[   57.597241]
               -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}:
[   57.597255]        down_write+0x6c/0x18c
[   57.597264]        start_creating+0xb4/0x24c
[   57.597274]        debugfs_create_dir+0x2c/0x1e8
[   57.597283]        blk_register_queue+0xec/0x294
[   57.597292]        add_disk_fwnode+0x2e4/0x548
[   57.597302]        brd_alloc+0x2c8/0x338
[   57.597309]        brd_init+0x100/0x178
[   57.597317]        do_one_initcall+0x88/0x3e4
[   57.597326]        kernel_init_freeable+0x3cc/0x6e0
[   57.597334]        kernel_init+0x34/0x1cc
[   57.597342]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597350]
               -> #4 (&q->debugfs_mutex){+.+.}-{4:4}:
[   57.597362]        __mutex_lock+0xfc/0x12a0
[   57.597370]        blk_register_queue+0xd4/0x294
[   57.597379]        add_disk_fwnode+0x2e4/0x548
[   57.597388]        brd_alloc+0x2c8/0x338
[   57.597395]        brd_init+0x100/0x178
[   57.597402]        do_one_initcall+0x88/0x3e4
[   57.597410]        kernel_init_freeable+0x3cc/0x6e0
[   57.597418]        kernel_init+0x34/0x1cc
[   57.597426]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597434]
               -> #3 (&q->sysfs_lock){+.+.}-{4:4}:
[   57.597446]        __mutex_lock+0xfc/0x12a0
[   57.597454]        queue_attr_store+0x9c/0x110
[   57.597462]        sysfs_kf_write+0x70/0xb0
[   57.597471]        kernfs_fop_write_iter+0x1b0/0x2ac
[   57.597480]        vfs_write+0x3dc/0x6e8
[   57.597488]        ksys_write+0x84/0x140
[   57.597495]        system_call_exception+0x130/0x360
[   57.597504]        system_call_common+0x160/0x2c4
[   57.597516]
               -> #2 (&q->q_usage_counter(io)torvalds#21){++++}-{0:0}:
[   57.597530]        __submit_bio+0x5ec/0x828
[   57.597538]        submit_bio_noacct_nocheck+0x1e4/0x4f0
[   57.597547]        iomap_readahead+0x2a0/0x448
[   57.597556]        xfs_vm_readahead+0x28/0x3c
[   57.597564]        read_pages+0x88/0x41c
[   57.597571]        page_cache_ra_unbounded+0x1ac/0x2d8
[   57.597580]        filemap_get_pages+0x188/0x984
[   57.597588]        filemap_read+0x13c/0x4bc
[   57.597596]        xfs_file_buffered_read+0x88/0x17c
[   57.597605]        xfs_file_read_iter+0xac/0x158
[   57.597614]        vfs_read+0x2d4/0x3b4
[   57.597622]        ksys_read+0x84/0x144
[   57.597629]        system_call_exception+0x130/0x360
[   57.597637]        system_call_common+0x160/0x2c4
[   57.597647]
               -> #1 (mapping.invalidate_lock#2){++++}-{4:4}:
[   57.597661]        down_read+0x6c/0x220
[   57.597669]        filemap_fault+0x870/0x100c
[   57.597677]        xfs_filemap_fault+0xc4/0x18c
[   57.597684]        __do_fault+0x64/0x164
[   57.597693]        __handle_mm_fault+0x1274/0x1dac
[   57.597702]        handle_mm_fault+0x248/0x484
[   57.597711]        ___do_page_fault+0x428/0xc0c
[   57.597719]        hash__do_page_fault+0x30/0x68
[   57.597727]        do_hash_fault+0x90/0x35c
[   57.597736]        data_access_common_virt+0x210/0x220
[   57.597745]        _copy_from_user+0xf8/0x19c
[   57.597754]        sel_write_load+0x178/0xd54
[   57.597762]        vfs_write+0x108/0x6e8
[   57.597769]        ksys_write+0x84/0x140
[   57.597777]        system_call_exception+0x130/0x360
[   57.597785]        system_call_common+0x160/0x2c4
[   57.597794]
               -> #0 (&mm->mmap_lock){++++}-{4:4}:
[   57.597806]        __lock_acquire+0x17cc/0x2330
[   57.597814]        lock_acquire+0x138/0x400
[   57.597822]        __might_fault+0x7c/0xc0
[   57.597830]        filldir64+0xe8/0x390
[   57.597839]        dcache_readdir+0x80/0x2d4
[   57.597846]        iterate_dir+0xd8/0x1d4
[   57.597855]        sys_getdents64+0x88/0x2d4
[   57.597864]        system_call_exception+0x130/0x360
[   57.597872]        system_call_common+0x160/0x2c4
[   57.597881]
               other info that might help us debug this:

[   57.597888] Chain exists of:
                 &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3

[   57.597905]  Possible unsafe locking scenario:

[   57.597911]        CPU0                    CPU1
[   57.597917]        ----                    ----
[   57.597922]   rlock(&sb->s_type->i_mutex_key#3);
[   57.597932]                                lock(&q->debugfs_mutex);
[   57.597940]                                lock(&sb->s_type->i_mutex_key#3);
[   57.597950]   rlock(&mm->mmap_lock);
[   57.597958]
                *** DEADLOCK ***

[   57.597965] 2 locks held by ls/4605:
[   57.597971]  #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154
[   57.597989]  #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4

Prevent the above lockdep warning by acquiring ->sysfs_lock before
freezing the queue while storing a queue attribute in queue_attr_store
function. Later, we also found[1] another function __blk_mq_update_nr_
hw_queues where we first freeze queue and then acquire the ->sysfs_lock.
So we've also updated lock ordering in __blk_mq_update_nr_hw_queues
function and ensured that in all code paths we follow the correct lock
ordering i.e. acquire ->sysfs_lock before freezing the queue.

[1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/

Reported-by: kjain@linux.ibm.com
Fixes: af28141 ("block: freeze the queue in queue_attr_store")
Tested-by: kjain@linux.ibm.com
Cc: hch@lst.de
Cc: axboe@kernel.dk
Cc: ritesh.list@gmail.com
Cc: ming.lei@redhat.com
Cc: gjoyce@linux.ibm.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241210144222.1066229-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
Guangguan Wang says:

====================
net: several fixes for smc

v1 -> v2:
rewrite patch #2 suggested by Paolo.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
The mapping VMA address is saved in VAS window struct when the
paste address is mapped. This VMA address is used during migration
to unmap the paste address if the window is active. The paste
address mapping will be removed when the window is closed or with
the munmap(). But the VMA address in the VAS window is not updated
with munmap() which is causing invalid access during migration.

The KASAN report shows:
[16386.254991] BUG: KASAN: slab-use-after-free in reconfig_close_windows+0x1a0/0x4e8
[16386.255043] Read of size 8 at addr c00000014a819670 by task drmgr/696928

[16386.255096] CPU: 29 UID: 0 PID: 696928 Comm: drmgr Kdump: loaded Tainted: G    B              6.11.0-rc5-nxgzip #2
[16386.255128] Tainted: [B]=BAD_PAGE
[16386.255148] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.00 (NH1110_016) hv:phyp pSeries
[16386.255181] Call Trace:
[16386.255202] [c00000016b297660] [c0000000018ad0ac] dump_stack_lvl+0x84/0xe8 (unreliable)
[16386.255246] [c00000016b297690] [c0000000006e8a90] print_report+0x19c/0x764
[16386.255285] [c00000016b297760] [c0000000006e9490] kasan_report+0x128/0x1f8
[16386.255309] [c00000016b297880] [c0000000006eb5c8] __asan_load8+0xac/0xe0
[16386.255326] [c00000016b2978a0] [c00000000013f898] reconfig_close_windows+0x1a0/0x4e8
[16386.255343] [c00000016b297990] [c000000000140e58] vas_migration_handler+0x3a4/0x3fc
[16386.255368] [c00000016b297a90] [c000000000128848] pseries_migrate_partition+0x4c/0x4c4
...

[16386.256136] Allocated by task 696554 on cpu 31 at 16377.277618s:
[16386.256149]  kasan_save_stack+0x34/0x68
[16386.256163]  kasan_save_track+0x34/0x80
[16386.256175]  kasan_save_alloc_info+0x58/0x74
[16386.256196]  __kasan_slab_alloc+0xb8/0xdc
[16386.256209]  kmem_cache_alloc_noprof+0x200/0x3d0
[16386.256225]  vm_area_alloc+0x44/0x150
[16386.256245]  mmap_region+0x214/0x10c4
[16386.256265]  do_mmap+0x5fc/0x750
[16386.256277]  vm_mmap_pgoff+0x14c/0x24c
[16386.256292]  ksys_mmap_pgoff+0x20c/0x348
[16386.256303]  sys_mmap+0xd0/0x160
...

[16386.256350] Freed by task 0 on cpu 31 at 16386.204848s:
[16386.256363]  kasan_save_stack+0x34/0x68
[16386.256374]  kasan_save_track+0x34/0x80
[16386.256384]  kasan_save_free_info+0x64/0x10c
[16386.256396]  __kasan_slab_free+0x120/0x204
[16386.256415]  kmem_cache_free+0x128/0x450
[16386.256428]  vm_area_free_rcu_cb+0xa8/0xd8
[16386.256441]  rcu_do_batch+0x2c8/0xcf0
[16386.256458]  rcu_core+0x378/0x3c4
[16386.256473]  handle_softirqs+0x20c/0x60c
[16386.256495]  do_softirq_own_stack+0x6c/0x88
[16386.256509]  do_softirq_own_stack+0x58/0x88
[16386.256521]  __irq_exit_rcu+0x1a4/0x20c
[16386.256533]  irq_exit+0x20/0x38
[16386.256544]  interrupt_async_exit_prepare.constprop.0+0x18/0x2c
...

[16386.256717] Last potentially related work creation:
[16386.256729]  kasan_save_stack+0x34/0x68
[16386.256741]  __kasan_record_aux_stack+0xcc/0x12c
[16386.256753]  __call_rcu_common.constprop.0+0x94/0xd04
[16386.256766]  vm_area_free+0x28/0x3c
[16386.256778]  remove_vma+0xf4/0x114
[16386.256797]  do_vmi_align_munmap.constprop.0+0x684/0x870
[16386.256811]  __vm_munmap+0xe0/0x1f8
[16386.256821]  sys_munmap+0x54/0x6c
[16386.256830]  system_call_exception+0x1a0/0x4a0
[16386.256841]  system_call_vectored_common+0x15c/0x2ec

[16386.256868] The buggy address belongs to the object at c00000014a819670
                which belongs to the cache vm_area_struct of size 168
[16386.256887] The buggy address is located 0 bytes inside of
                freed 168-byte region [c00000014a819670, c00000014a819718)

[16386.256915] The buggy address belongs to the physical page:
[16386.256928] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a81
[16386.256950] memcg:c0000000ba430001
[16386.256961] anon flags: 0x43ffff800000000(node=4|zone=0|lastcpupid=0x7ffff)
[16386.256975] page_type: 0xfdffffff(slab)
[16386.256990] raw: 043ffff800000000 c00000000501c080 0000000000000000 5deadbee00000001
[16386.257003] raw: 0000000000000000 00000000011a011a 00000001fdffffff c0000000ba430001
[16386.257018] page dumped because: kasan: bad access detected

This patch adds close() callback in vas_vm_ops vm_operations_struct
which will be executed during munmap() before freeing VMA. The VMA
address in the VAS window is set to NULL after holding the window
mmap_mutex.

Fixes: 37e6764 ("powerpc/pseries/vas: Add VAS migration handler")
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20241214051758.997759-1-haren@linux.ibm.com
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
Access to genmask field in struct nft_set_ext results in unaligned
atomic read:

[   72.130109] Unable to handle kernel paging request at virtual address ffff0000c2bb708c
[   72.131036] Mem abort info:
[   72.131213]   ESR = 0x0000000096000021
[   72.131446]   EC = 0x25: DABT (current EL), IL = 32 bits
[   72.132209]   SET = 0, FnV = 0
[   72.133216]   EA = 0, S1PTW = 0
[   72.134080]   FSC = 0x21: alignment fault
[   72.135593] Data abort info:
[   72.137194]   ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
[   72.142351]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   72.145989]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   72.150115] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000237d27000
[   72.154893] [ffff0000c2bb708c] pgd=0000000000000000, p4d=180000023ffff403, pud=180000023f84b403, pmd=180000023f835403,
+pte=0068000102bb7707
[   72.163021] Internal error: Oops: 0000000096000021 [#1] SMP
[...]
[   72.170041] CPU: 7 UID: 0 PID: 54 Comm: kworker/7:0 Tainted: G            E      6.13.0-rc3+ #2
[   72.170509] Tainted: [E]=UNSIGNED_MODULE
[   72.170720] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-stable202302-for-qemu 03/01/2023
[   72.171192] Workqueue: events_power_efficient nft_rhash_gc [nf_tables]
[   72.171552] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[   72.171915] pc : nft_rhash_gc+0x200/0x2d8 [nf_tables]
[   72.172166] lr : nft_rhash_gc+0x128/0x2d8 [nf_tables]
[   72.172546] sp : ffff800081f2bce0
[   72.172724] x29: ffff800081f2bd40 x28: ffff0000c2bb708c x27: 0000000000000038
[   72.173078] x26: ffff0000c6780ef0 x25: ffff0000c643df00 x24: ffff0000c6778f78
[   72.173431] x23: 000000000000001a x22: ffff0000c4b1f000 x21: ffff0000c6780f78
[   72.173782] x20: ffff0000c2bb70dc x19: ffff0000c2bb7080 x18: 0000000000000000
[   72.174135] x17: ffff0000c0a4e1c0 x16: 0000000000003000 x15: 0000ac26d173b978
[   72.174485] x14: ffffffffffffffff x13: 0000000000000030 x12: ffff0000c6780ef0
[   72.174841] x11: 0000000000000000 x10: ffff800081f2bcf8 x9 : ffff0000c3000000
[   72.175193] x8 : 00000000000004be x7 : 0000000000000000 x6 : 0000000000000000
[   72.175544] x5 : 0000000000000040 x4 : ffff0000c3000010 x3 : 0000000000000000
[   72.175871] x2 : 0000000000003a98 x1 : ffff0000c2bb708c x0 : 0000000000000004
[   72.176207] Call trace:
[   72.176316]  nft_rhash_gc+0x200/0x2d8 [nf_tables] (P)
[   72.176653]  process_one_work+0x178/0x3d0
[   72.176831]  worker_thread+0x200/0x3f0
[   72.176995]  kthread+0xe8/0xf8
[   72.177130]  ret_from_fork+0x10/0x20
[   72.177289] Code: 54fff984 d503201f d2800080 91003261 (f820303f)
[   72.177557] ---[ end trace 0000000000000000 ]---

Align struct nft_set_ext to word size to address this and
documentation it.

pahole reports that this increases the size of elements for rhash and
pipapo in 8 bytes on x86_64.

Fixes: 7ffc748 ("netfilter: nft_set_hash: skip duplicated elements pending gc run")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
syzbot reports that a recent fix causes nesting issues between the (now)
raw timeoutlock and the eventfd locking:

=============================
[ BUG: Invalid wait context ]
6.13.0-rc4-00080-g9828a4c0901f torvalds#29 Not tainted
-----------------------------
kworker/u32:0/68094 is trying to lock:
ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180
other info that might help us debug this:
context-{5:5}
6 locks held by kworker/u32:0/68094:
 #0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0
 #1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0
 #2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180
 #3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180
 #4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
 #5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
stack backtrace:
CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f torvalds#29
Hardware name: linux,dummy-virt (DT)
Workqueue: iou_exit io_ring_exit_work
Call trace:
 show_stack+0x1c/0x30 (C)
 __dump_stack+0x24/0x30
 dump_stack_lvl+0x60/0x80
 dump_stack+0x14/0x20
 __lock_acquire+0x19f8/0x60c8
 lock_acquire+0x1a4/0x540
 _raw_spin_lock_irqsave+0x90/0xd0
 eventfd_signal_mask+0x64/0x180
 io_eventfd_signal+0x64/0x108
 io_req_local_work_add+0x294/0x430
 __io_req_task_work_add+0x1c0/0x270
 io_kill_timeout+0x1f0/0x288
 io_kill_timeouts+0xd4/0x180
 io_uring_try_cancel_requests+0x2e8/0x388
 io_ring_exit_work+0x150/0x550
 process_one_work+0x5e8/0xfc0
 worker_thread+0x7ec/0xc80
 kthread+0x24c/0x300
 ret_from_fork+0x10/0x20

because after the preempt-rt fix for the timeout lock nesting inside
the io-wq lock, we now have the eventfd spinlock nesting inside the
raw timeout spinlock.

Rather than play whack-a-mole with other nesting on the timeout lock,
split the deletion and killing of timeouts so queueing the task_work
for the timeout cancelations can get done outside of the timeout lock.

Reported-by: syzbot+b1fc199a40b65d601b65@syzkaller.appspotmail.com
Fixes: 020b40f ("io_uring: make ctx->timeout_lock a raw spinlock")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
…le_direct_reclaim()

The task sometimes continues looping in throttle_direct_reclaim() because
allow_direct_reclaim(pgdat) keeps returning false.  

 #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
 #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
 #2 [ffff80002cb6f990] schedule at ffff800008abc50c
 #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
 torvalds#7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
 torvalds#8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
 torvalds#9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4

At this point, the pgdat contains the following two zones:

        NODE: 4  ZONE: 0  ADDR: ffff00817fffe540  NAME: "DMA32"
          SIZE: 20480  MIN/LOW/HIGH: 11/28/45
          VM_STAT:
                NR_FREE_PAGES: 359
        NR_ZONE_INACTIVE_ANON: 18813
          NR_ZONE_ACTIVE_ANON: 0
        NR_ZONE_INACTIVE_FILE: 50
          NR_ZONE_ACTIVE_FILE: 0
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

        NODE: 4  ZONE: 1  ADDR: ffff00817fffec00  NAME: "Normal"
          SIZE: 8454144  PRESENT: 98304  MIN/LOW/HIGH: 68/166/264
          VM_STAT:
                NR_FREE_PAGES: 146
        NR_ZONE_INACTIVE_ANON: 94668
          NR_ZONE_ACTIVE_ANON: 3
        NR_ZONE_INACTIVE_FILE: 735
          NR_ZONE_ACTIVE_FILE: 78
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero.  

Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.

        crash> p nr_swap_pages
        nr_swap_pages = $1937 = {
          counter = 0
        }

As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to
the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having
free pages significantly exceeding the high watermark.

The problem is that the pgdat->kswapd_failures hasn't been incremented.

        crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
        $1935 = 0x0

This is because the node deemed balanced.  The node balancing logic in
balance_pgdat() evaluates all zones collectively.  If one or more zones
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the
entire node is deemed balanced.  This causes balance_pgdat() to exit early
before incrementing the kswapd_failures, as it considers the overall
memory state acceptable, even though some zones (like ZONE_NORMAL) remain
under significant pressure.


The patch ensures that zone_reclaimable_pages() includes free pages
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are
available (e.g., file-backed or anonymous pages).  This change prevents
zones like ZONE_DMA32, which have sufficient free pages, from being
mistakenly deemed unreclaimable.  By doing so, the patch ensures proper
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by
allow_direct_reclaim(pgdat) repeatedly returning false.


The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused
by a node being incorrectly deemed balanced despite pressure in certain
zones, such as ZONE_NORMAL.  This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.

The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored
during reclaim, masking pressure in other zones.  Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().

This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist.  This ensures zones
with sufficient free pages are not skipped, enabling proper balancing and
reclaim behavior.

[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com
Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com
Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations")
Signed-off-by: Seiji Nishikawa <snishika@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MelodyHuibo pushed a commit that referenced this pull request Jan 20, 2025
We found a timeout problem with the pldm command on our system.  The
reason is that the MCTP-I3C driver has a race condition when receiving
multiple-packet messages in multi-thread, resulting in a wrong packet
order problem.

We identified this problem by adding a debug message to the
mctp_i3c_read function.

According to the MCTP spec, a multiple-packet message must be composed
in sequence, and if there is a wrong sequence, the whole message will be
discarded and wait for the next SOM.
For example, SOM → Pkt Seq #2 → Pkt Seq #1 → Pkt Seq #3 → EOM.

Therefore, we try to solve this problem by adding a mutex to the
mctp_i3c_read function.  Before the modification, when a command
requesting a multiple-packet message response is sent consecutively, an
error usually occurs within 100 loops.  After the mutex, it can go
through 40000 loops without any error, and it seems to run well.

Fixes: c8755b2 ("mctp i3c: MCTP I3C driver")
Signed-off-by: Leo Yang <Leo-Yang@quantatw.com>
Link: https://patch.msgid.link/20250107031529.3296094-1-Leo-Yang@quantatw.com
[pabeni@redhat.com: dropped already answered question from changelog]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mdroth pushed a commit that referenced this pull request Mar 19, 2025
irq_chip functions may be called in raw spinlock context. Therefore, we
must also use a raw spinlock for our own internal locking.

This fixes the following lockdep splat:

[    5.349336] =============================
[    5.353349] [ BUG: Invalid wait context ]
[    5.357361] 6.13.0-rc5+ torvalds#69 Tainted: G        W
[    5.363031] -----------------------------
[    5.367045] kworker/u17:1/44 is trying to lock:
[    5.371587] ffffff88018b02c0 (&chip->gpio_lock){....}-{3:3}, at: xgpio_irq_unmask (drivers/gpio/gpio-xilinx.c:433 (discriminator 8))
[    5.380079] other info that might help us debug this:
[    5.385138] context-{5:5}
[    5.387762] 5 locks held by kworker/u17:1/44:
[    5.392123] #0: ffffff8800014958 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3204)
[    5.402260] #1: ffffffc082fcbdd8 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3205)
[    5.411528] #2: ffffff880172c900 (&dev->mutex){....}-{4:4}, at: __device_attach (drivers/base/dd.c:1006)
[    5.419929] #3: ffffff88039c8268 (request_class#2){+.+.}-{4:4}, at: __setup_irq (kernel/irq/internals.h:156 kernel/irq/manage.c:1596)
[    5.428331] #4: ffffff88039c80c8 (lock_class#2){....}-{2:2}, at: __setup_irq (kernel/irq/manage.c:1614)
[    5.436472] stack backtrace:
[    5.439359] CPU: 2 UID: 0 PID: 44 Comm: kworker/u17:1 Tainted: G        W          6.13.0-rc5+ torvalds#69
[    5.448690] Tainted: [W]=WARN
[    5.451656] Hardware name: xlnx,zynqmp (DT)
[    5.455845] Workqueue: events_unbound deferred_probe_work_func
[    5.461699] Call trace:
[    5.464147] show_stack+0x18/0x24 C
[    5.467821] dump_stack_lvl (lib/dump_stack.c:123)
[    5.471501] dump_stack (lib/dump_stack.c:130)
[    5.474824] __lock_acquire (kernel/locking/lockdep.c:4828 kernel/locking/lockdep.c:4898 kernel/locking/lockdep.c:5176)
[    5.478758] lock_acquire (arch/arm64/include/asm/percpu.h:40 kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5851 kernel/locking/lockdep.c:5814)
[    5.482429] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
[    5.486797] xgpio_irq_unmask (drivers/gpio/gpio-xilinx.c:433 (discriminator 8))
[    5.490737] irq_enable (kernel/irq/internals.h:236 kernel/irq/chip.c:170 kernel/irq/chip.c:439 kernel/irq/chip.c:432 kernel/irq/chip.c:345)
[    5.494060] __irq_startup (kernel/irq/internals.h:241 kernel/irq/chip.c:180 kernel/irq/chip.c:250)
[    5.497645] irq_startup (kernel/irq/chip.c:270)
[    5.501143] __setup_irq (kernel/irq/manage.c:1807)
[    5.504728] request_threaded_irq (kernel/irq/manage.c:2208)

Fixes: a32c7ca ("gpio: gpio-xilinx: Add interrupt support")
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250110163354.2012654-1-sean.anderson@linux.dev
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
mdroth pushed a commit that referenced this pull request Mar 19, 2025
This commit addresses a circular locking dependency issue within the GFX
isolation mechanism. The problem was identified by a warning indicating
a potential deadlock due to inconsistent lock acquisition order.

- The `amdgpu_gfx_enforce_isolation_ring_begin_use` and
  `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously
  acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`,
  leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is
  called while `enforce_isolation_mutex` is held, and
  `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is
  held, it can create a circular dependency.

By ensuring consistent lock usage, this fix resolves the issue:

[  606.297333] ======================================================
[  606.297343] WARNING: possible circular locking dependency detected
[  606.297353] 6.10.0-amd-mlkd-610-311224-lof torvalds#19 Tainted: G           OE
[  606.297365] ------------------------------------------------------
[  606.297375] kworker/u96:3/3825 is trying to acquire lock:
[  606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610
[  606.297413]
               but task is already holding lock:
[  606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.297725]
               which lock already depends on the new lock.

[  606.297738]
               the existing dependency chain (in reverse order) is:
[  606.297749]
               -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}:
[  606.297765]        __mutex_lock+0x85/0x930
[  606.297776]        mutex_lock_nested+0x1b/0x30
[  606.297786]        amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.298007]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.298225]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.298412]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.298603]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.298866]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.298880]        process_one_work+0x21e/0x680
[  606.298890]        worker_thread+0x190/0x350
[  606.298899]        kthread+0xe7/0x120
[  606.298908]        ret_from_fork+0x3c/0x60
[  606.298919]        ret_from_fork_asm+0x1a/0x30
[  606.298929]
               -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}:
[  606.298947]        __mutex_lock+0x85/0x930
[  606.298956]        mutex_lock_nested+0x1b/0x30
[  606.298966]        amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu]
[  606.299190]        process_one_work+0x21e/0x680
[  606.299199]        worker_thread+0x190/0x350
[  606.299208]        kthread+0xe7/0x120
[  606.299217]        ret_from_fork+0x3c/0x60
[  606.299227]        ret_from_fork_asm+0x1a/0x30
[  606.299236]
               -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}:
[  606.299257]        __lock_acquire+0x16f9/0x2810
[  606.299267]        lock_acquire+0xd1/0x300
[  606.299276]        __flush_work+0x250/0x610
[  606.299286]        cancel_delayed_work_sync+0x71/0x80
[  606.299296]        amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.299509]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.299723]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.299909]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.300101]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.300355]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.300369]        process_one_work+0x21e/0x680
[  606.300378]        worker_thread+0x190/0x350
[  606.300387]        kthread+0xe7/0x120
[  606.300396]        ret_from_fork+0x3c/0x60
[  606.300406]        ret_from_fork_asm+0x1a/0x30
[  606.300416]
               other info that might help us debug this:

[  606.300428] Chain exists of:
                 (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex

[  606.300458]  Possible unsafe locking scenario:

[  606.300468]        CPU0                    CPU1
[  606.300476]        ----                    ----
[  606.300484]   lock(&adev->gfx.kfd_sch_mutex);
[  606.300494]                                lock(&adev->enforce_isolation_mutex);
[  606.300508]                                lock(&adev->gfx.kfd_sch_mutex);
[  606.300521]   lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work));
[  606.300536]
                *** DEADLOCK ***

[  606.300546] 5 locks held by kworker/u96:3/3825:
[  606.300555]  #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  606.300577]  #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  606.300600]  #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu]
[  606.300837]  #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.301062]  #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610
[  606.301083]
               stack backtrace:
[  606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G           OE      6.10.0-amd-mlkd-610-311224-lof torvalds#19
[  606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024
[  606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
[  606.301140] Call Trace:
[  606.301146]  <TASK>
[  606.301154]  dump_stack_lvl+0x9b/0xf0
[  606.301166]  dump_stack+0x10/0x20
[  606.301175]  print_circular_bug+0x26c/0x340
[  606.301187]  check_noncircular+0x157/0x170
[  606.301197]  ? register_lock_class+0x48/0x490
[  606.301213]  __lock_acquire+0x16f9/0x2810
[  606.301230]  lock_acquire+0xd1/0x300
[  606.301239]  ? __flush_work+0x232/0x610
[  606.301250]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301261]  ? mark_held_locks+0x54/0x90
[  606.301274]  ? __flush_work+0x232/0x610
[  606.301284]  __flush_work+0x250/0x610
[  606.301293]  ? __flush_work+0x232/0x610
[  606.301305]  ? __pfx_wq_barrier_func+0x10/0x10
[  606.301318]  ? mark_held_locks+0x54/0x90
[  606.301331]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301345]  cancel_delayed_work_sync+0x71/0x80
[  606.301356]  amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.301661]  amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.302050]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.302069]  amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.302452]  amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.302862]  ? drm_sched_entity_error+0x82/0x190 [gpu_sched]
[  606.302890]  amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.303366]  drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.303388]  process_one_work+0x21e/0x680
[  606.303409]  worker_thread+0x190/0x350
[  606.303424]  ? __pfx_worker_thread+0x10/0x10
[  606.303437]  kthread+0xe7/0x120
[  606.303449]  ? __pfx_kthread+0x10/0x10
[  606.303463]  ret_from_fork+0x3c/0x60
[  606.303476]  ? __pfx_kthread+0x10/0x10
[  606.303489]  ret_from_fork_asm+0x1a/0x30
[  606.303512]  </TASK>

v2: Refactor lock handling to resolve circular dependency (Alex)

- Introduced a `sched_work` flag to defer the call to
  `amdgpu_gfx_kfd_sch_ctrl` until after releasing
  `enforce_isolation_mutex`.
- This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside
  the critical section, preventing the circular dependency and deadlock.
- The `sched_work` flag is set within the mutex-protected section if
  conditions are met, and the actual function call is made afterward.
- This approach ensures consistent lock acquisition order.

Fixes: afefd6f ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization")
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0b6b2dd)
Cc: stable@vger.kernel.org
mdroth pushed a commit that referenced this pull request Mar 19, 2025
syz reports an out of bounds read:

==================================================================
BUG: KASAN: slab-out-of-bounds in ocfs2_match fs/ocfs2/dir.c:334
[inline]
BUG: KASAN: slab-out-of-bounds in ocfs2_search_dirblock+0x283/0x6e0
fs/ocfs2/dir.c:367
Read of size 1 at addr ffff88804d8b9982 by task syz-executor.2/14802

CPU: 0 UID: 0 PID: 14802 Comm: syz-executor.2 Not tainted 6.13.0-rc4 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
04/01/2014
Sched_ext: serialise (enabled+all), task: runnable_at=-10ms
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x229/0x350 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x164/0x530 mm/kasan/report.c:489
kasan_report+0x147/0x180 mm/kasan/report.c:602
ocfs2_match fs/ocfs2/dir.c:334 [inline]
ocfs2_search_dirblock+0x283/0x6e0 fs/ocfs2/dir.c:367
ocfs2_find_entry_id fs/ocfs2/dir.c:414 [inline]
ocfs2_find_entry+0x1143/0x2db0 fs/ocfs2/dir.c:1078
ocfs2_find_files_on_disk+0x18e/0x530 fs/ocfs2/dir.c:1981
ocfs2_lookup_ino_from_name+0xb6/0x110 fs/ocfs2/dir.c:2003
ocfs2_lookup+0x30a/0xd40 fs/ocfs2/namei.c:122
lookup_open fs/namei.c:3627 [inline]
open_last_lookups fs/namei.c:3748 [inline]
path_openat+0x145a/0x3870 fs/namei.c:3984
do_filp_open+0xe9/0x1c0 fs/namei.c:4014
do_sys_openat2+0x135/0x1d0 fs/open.c:1402
do_sys_open fs/open.c:1417 [inline]
__do_sys_openat fs/open.c:1433 [inline]
__se_sys_openat fs/open.c:1428 [inline]
__x64_sys_openat+0x15d/0x1c0 fs/open.c:1428
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f01076903ad
Code: c3 e8 a7 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f01084acfc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f01077cbf80 RCX: 00007f01076903ad
RDX: 0000000000105042 RSI: 0000000020000080 RDI: ffffffffffffff9c
RBP: 00007f01077cbf80 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001ff R11: 0000000000000246 R12: 0000000000000000
R13: 00007f01077cbf80 R14: 00007f010764fc90 R15: 00007f010848d000
</TASK>
==================================================================

And a general protection fault in ocfs2_prepare_dir_for_insert:

==================================================================
loop0: detected capacity change from 0 to 32768
JBD2: Ignoring recovery information on journal
ocfs2: Mounting device (7,0) on (node local, slot 0) with ordered data
mode.
Oops: general protection fault, probably for non-canonical address
0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 UID: 0 PID: 5096 Comm: syz-executor792 Not tainted
6.11.0-rc4-syzkaller-00002-gb0da640826ba #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:ocfs2_find_dir_space_id fs/ocfs2/dir.c:3406 [inline]
RIP: 0010:ocfs2_prepare_dir_for_insert+0x3309/0x5c70 fs/ocfs2/dir.c:4280
Code: 00 00 e8 2a 25 13 fe e9 ba 06 00 00 e8 20 25 13 fe e9 4f 01 00 00
e8 16 25 13 fe 49 8d 7f 08 49 8d 5f 09 48 89 f8 48 c1 e8 03 <42> 0f b6
04 20 84 c0 0f 85 bd 23 00 00 48 89 d8 48 c1 e8 03 42 0f
RSP: 0018:ffffc9000af9f020 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000009 RCX: ffff88801e27a440
RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000008
RBP: ffffc9000af9f830 R08: ffffffff8380395b R09: ffffffff838090a7
R10: 0000000000000002 R11: ffff88801e27a440 R12: dffffc0000000000
R13: ffff88803c660878 R14: f700000000000088 R15: 0000000000000000
FS:  000055555a677380(0000) GS:ffff888020800000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560bce569178 CR3: 000000001de5a000 CR4: 0000000000350ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ocfs2_mknod+0xcaf/0x2b40 fs/ocfs2/namei.c:292
vfs_mknod+0x36d/0x3b0 fs/namei.c:4088
do_mknodat+0x3ec/0x5b0
__do_sys_mknodat fs/namei.c:4166 [inline]
__se_sys_mknodat fs/namei.c:4163 [inline]
__x64_sys_mknodat+0xa7/0xc0 fs/namei.c:4163
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f2dafda3a99
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 17 00 00 90 48 89
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8
64 89 01 48
RSP: 002b:00007ffe336a6658 EFLAGS: 00000246 ORIG_RAX:
0000000000000103
RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f2dafda3a99
RDX: 00000000000021c0 RSI: 0000000020000040 RDI:
00000000ffffff9c
RBP: 00007f2dafe1b5f0 R08: 0000000000004480 R09:
000055555a6784c0
R10: 0000000000000103 R11: 0000000000000246 R12:
00007ffe336a6680
R13: 00007ffe336a68a8 R14: 431bde82d7b634db R15:
00007f2dafdec03b
</TASK>
==================================================================

The two reports are all caused invalid negative i_size of dir inode.  For
ocfs2, dir_inode can't be negative or zero.

Here add a check in which is called by ocfs2_check_dir_for_entry().  It
fixes the second report as ocfs2_check_dir_for_entry() must be called
before ocfs2_prepare_dir_for_insert().  Also set a up limit for dir with
OCFS2_INLINE_DATA_FL.  The i_size can't be great than blocksize.

Link: https://lkml.kernel.org/r/20250106140640.92260-1-glass.su@suse.com
Reported-by: Jiacheng Xu <stitch@zju.edu.cn>
Link: https://lore.kernel.org/ocfs2-devel/17a04f01.1ae74.19436d003fc.Coremail.stitch@zju.edu.cn/T/#u
Reported-by: syzbot+5a64828fcc4c2ad9b04f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/0000000000005894f3062018caf1@google.com/T/
Signed-off-by: Su Yue <glass.su@suse.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mdroth pushed a commit that referenced this pull request Mar 19, 2025
Fix a lockdep warning [1] observed during the write combining test.

The warning indicates a potential nested lock scenario that could lead
to a deadlock.

However, this is a false positive alarm because the SF lock and its
parent lock are distinct ones.

The lockdep confusion arises because the locks belong to the same object
class (i.e., struct mlx5_core_dev).

To resolve this, the code has been refactored to avoid taking both
locks. Instead, only the parent lock is acquired.

[1]
raw_ethernet_bw/2118 is trying to acquire lock:
[  213.619032] ffff88811dd75e08 (&dev->wc_state_lock){+.+.}-{3:3}, at:
               mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.620270]
[  213.620270] but task is already holding lock:
[  213.620943] ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3}, at:
               mlx5_wc_support_get+0x10c/0x210 [mlx5_core]
[  213.622045]
[  213.622045] other info that might help us debug this:
[  213.622778]  Possible unsafe locking scenario:
[  213.622778]
[  213.623465]        CPU0
[  213.623815]        ----
[  213.624148]   lock(&dev->wc_state_lock);
[  213.624615]   lock(&dev->wc_state_lock);
[  213.625071]
[  213.625071]  *** DEADLOCK ***
[  213.625071]
[  213.625805]  May be due to missing lock nesting notation
[  213.625805]
[  213.626522] 4 locks held by raw_ethernet_bw/2118:
[  213.627019]  #0: ffff88813f80d578 (&uverbs_dev->disassociate_srcu){.+.+}-{0:0},
                at: ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.628088]  #1: ffff88810fb23930 (&file->hw_destroy_rwsem){.+.+}-{3:3},
                at: ib_init_ucontext+0x2d/0xf0 [ib_uverbs]
[  213.629094]  #2: ffff88810fb23878 (&file->ucontext_lock){+.+.}-{3:3},
                at: ib_init_ucontext+0x49/0xf0 [ib_uverbs]
[  213.630106]  #3: ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3},
                at: mlx5_wc_support_get+0x10c/0x210 [mlx5_core]
[  213.631185]
[  213.631185] stack backtrace:
[  213.631718] CPU: 1 UID: 0 PID: 2118 Comm: raw_ethernet_bw Not tainted
               6.12.0-rc7_internal_net_next_mlx5_89a0ad0 #1
[  213.632722] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
               rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  213.633785] Call Trace:
[  213.634099]
[  213.634393]  dump_stack_lvl+0x7e/0xc0
[  213.634806]  print_deadlock_bug+0x278/0x3c0
[  213.635265]  __lock_acquire+0x15f4/0x2c40
[  213.635712]  lock_acquire+0xcd/0x2d0
[  213.636120]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.636722]  ? mlx5_ib_enable_lb+0x24/0xa0 [mlx5_ib]
[  213.637277]  __mutex_lock+0x81/0xda0
[  213.637697]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.638305]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.638902]  ? rcu_read_lock_sched_held+0x3f/0x70
[  213.639400]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.640016]  mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.640615]  set_ucontext_resp+0x68/0x2b0 [mlx5_ib]
[  213.641144]  ? debug_mutex_init+0x33/0x40
[  213.641586]  mlx5_ib_alloc_ucontext+0x18e/0x7b0 [mlx5_ib]
[  213.642145]  ib_init_ucontext+0xa0/0xf0 [ib_uverbs]
[  213.642679]  ib_uverbs_handler_UVERBS_METHOD_GET_CONTEXT+0x95/0xc0
                [ib_uverbs]
[  213.643426]  ? _copy_from_user+0x46/0x80
[  213.643878]  ib_uverbs_cmd_verbs+0xa6b/0xc80 [ib_uverbs]
[  213.644426]  ? ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x130/0x130
               [ib_uverbs]
[  213.645213]  ? __lock_acquire+0xa99/0x2c40
[  213.645675]  ? lock_acquire+0xcd/0x2d0
[  213.646101]  ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.646625]  ? reacquire_held_locks+0xcf/0x1f0
[  213.647102]  ? do_user_addr_fault+0x45d/0x770
[  213.647586]  ib_uverbs_ioctl+0xe0/0x170 [ib_uverbs]
[  213.648102]  ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.648632]  __x64_sys_ioctl+0x4d3/0xaa0
[  213.649060]  ? do_user_addr_fault+0x4a8/0x770
[  213.649528]  do_syscall_64+0x6d/0x140
[  213.649947]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[  213.650478] RIP: 0033:0x7fa179b0737b
[  213.650893] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c
               89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8
               10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d
               7d 2a 0f 00 f7 d8 64 89 01 48
[  213.652619] RSP: 002b:00007ffd2e6d46e8 EFLAGS: 00000246 ORIG_RAX:
               0000000000000010
[  213.653390] RAX: ffffffffffffffda RBX: 00007ffd2e6d47f8 RCX:
               00007fa179b0737b
[  213.654084] RDX: 00007ffd2e6d47e0 RSI: 00000000c0181b01 RDI:
               0000000000000003
[  213.654767] RBP: 00007ffd2e6d47c0 R08: 00007fa1799be010 R09:
               0000000000000002
[  213.655453] R10: 00007ffd2e6d4960 R11: 0000000000000246 R12:
               00007ffd2e6d487c
[  213.656170] R13: 0000000000000027 R14: 0000000000000001 R15:
               00007ffd2e6d4f70

Fixes: d98995b ("net/mlx5: Reimplement write combining test")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mdroth pushed a commit that referenced this pull request Mar 19, 2025
Clear the port select structure on error so no stale values left after
definers are destroyed. That's because the mlx5_lag_destroy_definers()
always try to destroy all lag definers in the tt_map, so in the flow
below lag definers get double-destroyed and cause kernel crash:

  mlx5_lag_port_sel_create()
    mlx5_lag_create_definers()
      mlx5_lag_create_definer()     <- Failed on tt 1
        mlx5_lag_destroy_definers() <- definers[tt=0] gets destroyed
  mlx5_lag_port_sel_create()
    mlx5_lag_create_definers()
      mlx5_lag_create_definer()     <- Failed on tt 0
        mlx5_lag_destroy_definers() <- definers[tt=0] gets double-destroyed

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
 Mem abort info:
   ESR = 0x0000000096000005
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x05: level 1 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
 user pgtable: 64k pages, 48-bit VAs, pgdp=0000000112ce2e00
 [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
 Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
 Modules linked in: iptable_raw bonding ip_gre ip6_gre gre ip6_tunnel tunnel6 geneve ip6_udp_tunnel udp_tunnel ipip tunnel4 ip_tunnel rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_fwctl(OE) fwctl(OE) mlx5_core(OE) mlxdevm(OE) ib_core(OE) mlxfw(OE) memtrack(OE) mlx_compat(OE) openvswitch nsh nf_conncount psample xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc netconsole overlay efi_pstore sch_fq_codel zram ip_tables crct10dif_ce qemu_fw_cfg fuse ipv6 crc_ccitt [last unloaded: mlx_compat(OE)]
  CPU: 3 UID: 0 PID: 217 Comm: kworker/u53:2 Tainted: G           OE      6.11.0+ #2
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
  pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core]
  lr : mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core]
  sp : ffff800085fafb00
  x29: ffff800085fafb00 x28: ffff0000da0c8000 x27: 0000000000000000
  x26: ffff0000da0c8000 x25: ffff0000da0c8000 x24: ffff0000da0c8000
  x23: ffff0000c31f81a0 x22: 0400000000000000 x21: ffff0000da0c8000
  x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff8b0c9350
  x14: 0000000000000000 x13: ffff800081390d18 x12: ffff800081dc3cc0
  x11: 0000000000000001 x10: 0000000000000b10 x9 : ffff80007ab7304c
  x8 : ffff0000d00711f0 x7 : 0000000000000004 x6 : 0000000000000190
  x5 : ffff00027edb3010 x4 : 0000000000000000 x3 : 0000000000000000
  x2 : ffff0000d39b8000 x1 : ffff0000d39b8000 x0 : 0400000000000000
  Call trace:
   mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core]
   mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core]
   mlx5_lag_destroy_definers+0xa0/0x108 [mlx5_core]
   mlx5_lag_port_sel_create+0x2d4/0x6f8 [mlx5_core]
   mlx5_activate_lag+0x60c/0x6f8 [mlx5_core]
   mlx5_do_bond_work+0x284/0x5c8 [mlx5_core]
   process_one_work+0x170/0x3e0
   worker_thread+0x2d8/0x3e0
   kthread+0x11c/0x128
   ret_from_fork+0x10/0x20
  Code: a9025bf5 aa0003f6 a90363f7 f90023f9 (f9400400)
  ---[ end trace 0000000000000000 ]---

Fixes: dc48516 ("net/mlx5: Lag, add support to create definers for LAG")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mdroth pushed a commit that referenced this pull request Oct 10, 2025
When I run the NVME over TCP test in virtme-ng, I get the following
"suspicious RCU usage" warning in nvme_mpath_add_sysfs_link():

'''
[    5.024557][   T44] nvmet: Created nvm controller 1 for subsystem nqn.2025-06.org.nvmexpress.mptcp for NQN nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77.
[    5.027401][  T183] nvme nvme0: creating 2 I/O queues.
[    5.029017][  T183] nvme nvme0: mapped 2/0/0 default/read/poll queues.
[    5.032587][  T183] nvme nvme0: new ctrl: NQN "nqn.2025-06.org.nvmexpress.mptcp", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77
[    5.042214][   T25]
[    5.042440][   T25] =============================
[    5.042579][   T25] WARNING: suspicious RCU usage
[    5.042705][   T25] 6.16.0-rc3+ torvalds#23 Not tainted
[    5.042812][   T25] -----------------------------
[    5.042934][   T25] drivers/nvme/host/multipath.c:1203 RCU-list traversed in non-reader section!!
[    5.043111][   T25]
[    5.043111][   T25] other info that might help us debug this:
[    5.043111][   T25]
[    5.043341][   T25]
[    5.043341][   T25] rcu_scheduler_active = 2, debug_locks = 1
[    5.043502][   T25] 3 locks held by kworker/u9:0/25:
[    5.043615][   T25]  #0: ffff888008730948 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x7ed/0x1350
[    5.043830][   T25]  #1: ffffc900001afd40 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0xcf3/0x1350
[    5.044084][   T25]  #2: ffff888013ee0020 (&head->srcu){.+.+}-{0:0}, at: nvme_mpath_add_sysfs_link.part.0+0xb4/0x3a0
[    5.044300][   T25]
[    5.044300][   T25] stack backtrace:
[    5.044439][   T25] CPU: 0 UID: 0 PID: 25 Comm: kworker/u9:0 Not tainted 6.16.0-rc3+ torvalds#23 PREEMPT(full)
[    5.044441][   T25] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    5.044442][   T25] Workqueue: async async_run_entry_fn
[    5.044445][   T25] Call Trace:
[    5.044446][   T25]  <TASK>
[    5.044449][   T25]  dump_stack_lvl+0x6f/0xb0
[    5.044453][   T25]  lockdep_rcu_suspicious.cold+0x4f/0xb1
[    5.044457][   T25]  nvme_mpath_add_sysfs_link.part.0+0x2fb/0x3a0
[    5.044459][   T25]  ? queue_work_on+0x90/0xf0
[    5.044461][   T25]  ? lockdep_hardirqs_on+0x78/0x110
[    5.044466][   T25]  nvme_mpath_set_live+0x1e9/0x4f0
[    5.044470][   T25]  nvme_mpath_add_disk+0x240/0x2f0
[    5.044472][   T25]  ? __pfx_nvme_mpath_add_disk+0x10/0x10
[    5.044475][   T25]  ? add_disk_fwnode+0x361/0x580
[    5.044480][   T25]  nvme_alloc_ns+0x81c/0x17c0
[    5.044483][   T25]  ? kasan_quarantine_put+0x104/0x240
[    5.044487][   T25]  ? __pfx_nvme_alloc_ns+0x10/0x10
[    5.044495][   T25]  ? __pfx_nvme_find_get_ns+0x10/0x10
[    5.044496][   T25]  ? rcu_read_lock_any_held+0x45/0xa0
[    5.044498][   T25]  ? validate_chain+0x232/0x4f0
[    5.044503][   T25]  nvme_scan_ns+0x4c8/0x810
[    5.044506][   T25]  ? __pfx_nvme_scan_ns+0x10/0x10
[    5.044508][   T25]  ? find_held_lock+0x2b/0x80
[    5.044512][   T25]  ? ktime_get+0x16d/0x220
[    5.044517][   T25]  ? kvm_clock_get_cycles+0x18/0x30
[    5.044520][   T25]  ? __pfx_nvme_scan_ns_async+0x10/0x10
[    5.044522][   T25]  async_run_entry_fn+0x97/0x560
[    5.044523][   T25]  ? rcu_is_watching+0x12/0xc0
[    5.044526][   T25]  process_one_work+0xd3c/0x1350
[    5.044532][   T25]  ? __pfx_process_one_work+0x10/0x10
[    5.044536][   T25]  ? assign_work+0x16c/0x240
[    5.044539][   T25]  worker_thread+0x4da/0xd50
[    5.044545][   T25]  ? __pfx_worker_thread+0x10/0x10
[    5.044546][   T25]  kthread+0x356/0x5c0
[    5.044548][   T25]  ? __pfx_kthread+0x10/0x10
[    5.044549][   T25]  ? ret_from_fork+0x1b/0x2e0
[    5.044552][   T25]  ? __lock_release.isra.0+0x5d/0x180
[    5.044553][   T25]  ? ret_from_fork+0x1b/0x2e0
[    5.044555][   T25]  ? rcu_is_watching+0x12/0xc0
[    5.044557][   T25]  ? __pfx_kthread+0x10/0x10
[    5.044559][   T25]  ret_from_fork+0x218/0x2e0
[    5.044561][   T25]  ? __pfx_kthread+0x10/0x10
[    5.044562][   T25]  ret_from_fork_asm+0x1a/0x30
[    5.044570][   T25]  </TASK>
'''

This patch uses sleepable RCU version of helper list_for_each_entry_srcu()
instead of list_for_each_entry_rcu() to fix it.

Fixes: 4dbd2b2 ("nvme-multipath: Add visibility for round-robin io-policy")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
mdroth pushed a commit that referenced this pull request Oct 10, 2025
With VIRTCHNL2_CAP_MACFILTER enabled, the following warning is generated
on module load:

[  324.701677] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:578
[  324.701684] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1582, name: NetworkManager
[  324.701689] preempt_count: 201, expected: 0
[  324.701693] RCU nest depth: 0, expected: 0
[  324.701697] 2 locks held by NetworkManager/1582:
[  324.701702]  #0: ffffffff9f7be770 (rtnl_mutex){....}-{3:3}, at: rtnl_newlink+0x791/0x21e0
[  324.701730]  #1: ff1100216c380368 (_xmit_ETHER){....}-{2:2}, at: __dev_open+0x3f0/0x870
[  324.701749] Preemption disabled at:
[  324.701752] [<ffffffff9cd23b9d>] __dev_open+0x3dd/0x870
[  324.701765] CPU: 30 UID: 0 PID: 1582 Comm: NetworkManager Not tainted 6.15.0-rc5+ #2 PREEMPT(voluntary)
[  324.701771] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
[  324.701774] Call Trace:
[  324.701777]  <TASK>
[  324.701779]  dump_stack_lvl+0x5d/0x80
[  324.701788]  ? __dev_open+0x3dd/0x870
[  324.701793]  __might_resched.cold+0x1ef/0x23d
<..>
[  324.701818]  __mutex_lock+0x113/0x1b80
<..>
[  324.701917]  idpf_ctlq_clean_sq+0xad/0x4b0 [idpf]
[  324.701935]  ? kasan_save_track+0x14/0x30
[  324.701941]  idpf_mb_clean+0x143/0x380 [idpf]
<..>
[  324.701991]  idpf_send_mb_msg+0x111/0x720 [idpf]
[  324.702009]  idpf_vc_xn_exec+0x4cc/0x990 [idpf]
[  324.702021]  ? rcu_is_watching+0x12/0xc0
[  324.702035]  idpf_add_del_mac_filters+0x3ed/0xb50 [idpf]
<..>
[  324.702122]  __hw_addr_sync_dev+0x1cf/0x300
[  324.702126]  ? find_held_lock+0x32/0x90
[  324.702134]  idpf_set_rx_mode+0x317/0x390 [idpf]
[  324.702152]  __dev_open+0x3f8/0x870
[  324.702159]  ? __pfx___dev_open+0x10/0x10
[  324.702174]  __dev_change_flags+0x443/0x650
<..>
[  324.702208]  netif_change_flags+0x80/0x160
[  324.702218]  do_setlink.isra.0+0x16a0/0x3960
<..>
[  324.702349]  rtnl_newlink+0x12fd/0x21e0

The sequence is as follows:
	rtnl_newlink()->
	__dev_change_flags()->
	__dev_open()->
	dev_set_rx_mode() - >  # disables BH and grabs "dev->addr_list_lock"
	idpf_set_rx_mode() ->  # proceed only if VIRTCHNL2_CAP_MACFILTER is ON
	__dev_uc_sync() ->
	idpf_add_mac_filter ->
	idpf_add_del_mac_filters ->
	idpf_send_mb_msg() ->
	idpf_mb_clean() ->
	idpf_ctlq_clean_sq()   # mutex_lock(cq_lock)

Fix by converting cq_lock to a spinlock. All operations under the new
lock are safe except freeing the DMA memory, which may use vunmap(). Fix
by requesting a contiguous physical memory for the DMA mapping.

Fixes: a251eee ("idpf: add SRIOV support and other ndo_ops")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
mdroth pushed a commit that referenced this pull request Oct 10, 2025
If "try_verify_in_tasklet" is set for dm-verity, DM_BUFIO_CLIENT_NO_SLEEP
is enabled for dm-bufio. However, when bufio tries to evict buffers, there
is a chance to trigger scheduling in spin_lock_bh, the following warning
is hit:

BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2745
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 123, name: kworker/2:2
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
4 locks held by kworker/2:2/123:
 #0: ffff88800a2d1548 ((wq_completion)dm_bufio_cache){....}-{0:0}, at: process_one_work+0xe46/0x1970
 #1: ffffc90000d97d20 ((work_completion)(&dm_bufio_replacement_work)){....}-{0:0}, at: process_one_work+0x763/0x1970
 #2: ffffffff8555b528 (dm_bufio_clients_lock){....}-{3:3}, at: do_global_cleanup+0x1ce/0x710
 #3: ffff88801d5820b8 (&c->spinlock){....}-{2:2}, at: do_global_cleanup+0x2a5/0x710
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 2 UID: 0 PID: 123 Comm: kworker/2:2 Not tainted 6.16.0-rc3-g90548c634bd0 torvalds#305 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: dm_bufio_cache do_global_cleanup
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 __might_resched+0x360/0x4e0
 do_global_cleanup+0x2f5/0x710
 process_one_work+0x7db/0x1970
 worker_thread+0x518/0xea0
 kthread+0x359/0x690
 ret_from_fork+0xf3/0x1b0
 ret_from_fork_asm+0x1a/0x30
 </TASK>

That can be reproduced by:

  veritysetup format --data-block-size=4096 --hash-block-size=4096 /dev/vda /dev/vdb
  SIZE=$(blockdev --getsz /dev/vda)
  dmsetup create myverity -r --table "0 $SIZE verity 1 /dev/vda /dev/vdb 4096 4096 <data_blocks> 1 sha256 <root_hash> <salt> 1 try_verify_in_tasklet"
  mount /dev/dm-0 /mnt -o ro
  echo 102400 > /sys/module/dm_bufio/parameters/max_cache_size_bytes
  [read files in /mnt]

Cc: stable@vger.kernel.org	# v6.4+
Fixes: 450e8de ("dm bufio: improve concurrent IO performance")
Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com>
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
mdroth pushed a commit that referenced this pull request Oct 10, 2025
This reverts commit 7796c97.

This patch broke Dragonboard 845c (sdm845). I see:

    Unexpected kernel BRK exception at EL1
    Internal error: BRK handler: 00000000f20003e8 [#1]  SMP
    pc : qcom_swrm_set_channel_map+0x7c/0x80 [soundwire_qcom]
    lr : snd_soc_dai_set_channel_map+0x34/0x78
    Call trace:
     qcom_swrm_set_channel_map+0x7c/0x80 [soundwire_qcom] (P)
     sdm845_dai_init+0x18c/0x2e0 [snd_soc_sdm845]
     snd_soc_link_init+0x28/0x6c
     snd_soc_bind_card+0x5f4/0xb0c
     snd_soc_register_card+0x148/0x1a4
     devm_snd_soc_register_card+0x50/0xb0
     sdm845_snd_platform_probe+0x124/0x148 [snd_soc_sdm845]
     platform_probe+0x6c/0xd0
     really_probe+0xc0/0x2a4
     __driver_probe_device+0x7c/0x130
     driver_probe_device+0x40/0x118
     __device_attach_driver+0xc4/0x108
     bus_for_each_drv+0x8c/0xf0
     __device_attach+0xa4/0x198
     device_initial_probe+0x18/0x28
     bus_probe_device+0xb8/0xbc
     deferred_probe_work_func+0xac/0xfc
     process_one_work+0x244/0x658
     worker_thread+0x1b4/0x360
     kthread+0x148/0x228
     ret_from_fork+0x10/0x20
    Kernel panic - not syncing: BRK handler: Fatal exception

Dan has also reported following issues with the original patch
https://lore.kernel.org/all/33fe8fe7-719a-405a-9ed2-d9f816ce1d57@sabinyo.mountain/

Bug #1:
The zeroeth element of ctrl->pconfig[] is supposed to be unused.  We
start counting at 1.  However this code sets ctrl->pconfig[0].ch_mask = 128.

Bug #2:
There are SLIM_MAX_TX_PORTS (16) elements in tx_ch[] array but only
QCOM_SDW_MAX_PORTS + 1 (15) in the ctrl->pconfig[] array so it corrupts
memory like Yongqin Liu pointed out.

Bug 3:
Like Jie Gan pointed out, it erases all the tx information with the rx
information.

Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Acked-by: Srinivas Kandagatla <srini@kernel.org>
Link: https://lore.kernel.org/r/20250709174949.8541-1-amit.pundir@linaro.org
Signed-off-by: Vinod Koul <vkoul@kernel.org>
mdroth pushed a commit that referenced this pull request Oct 10, 2025
 into HEAD

KVM/riscv fixes for 6.16, take #2

- Disable vstimecmp before exiting to user-space
- Move HGEI[E|P] CSR access to IMSIC virtualization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant