Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU freqency scaling issue. #12

Open
swejuggalo opened this issue Nov 30, 2020 · 2 comments
Open

CPU freqency scaling issue. #12

swejuggalo opened this issue Nov 30, 2020 · 2 comments

Comments

@swejuggalo
Copy link

I've looked around and haven't really found a good answer or reason for this.
This seems to affect firmware from beta 2 and forward (In my case stock firmware and 8 Pro).
I have a constant frequency monitor app running and hoping to find a pattern of why and when it happens.

The problem in short can be described as refusal of using freqency lower than 1056MHz. It is still allowed to deep sleep but locked out from using lower freqencies ranges on all CPU cores. This can happen randomly. From within an hour to 1 or 2 days uptime.

I've seen a few commits to CPU scheduler. And some seems to touch this area but I don't really know if they can be the fault of this behavior.
Could any recent changes auto adjust the ranges allowed?

For reference.
https://forums.oneplus.com/threads/8-pro-beta-3-cpu-frequency-scaling.1340830/

@swejuggalo
Copy link
Author

b619cde

097243c

And perhaps most relevant
7816143
f177186
6b3fd22

vutung2311 pushed a commit to vutung2311/android_kernel_oneplus_sm8250 that referenced this issue Feb 27, 2021
commit e3336ca upstream.

We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
doesn't have any reclaimable memory.

It can be easily reproduced as below:

  watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
  CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ OnePlusOSS#12
  Call Trace:
    shrink_lruvec+0x49f/0x640
    shrink_node+0x2a6/0x6f0
    do_try_to_free_pages+0xe9/0x3e0
    try_to_free_mem_cgroup_pages+0xef/0x1f0
    try_charge+0x2c1/0x750
    mem_cgroup_charge+0xd7/0x240
    __add_to_page_cache_locked+0x2fd/0x370
    add_to_page_cache_lru+0x4a/0xc0
    pagecache_get_page+0x10b/0x2f0
    filemap_fault+0x661/0xad0
    ext4_filemap_fault+0x2c/0x40
    __do_fault+0x4d/0xf9
    handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance for
oom reaper to run to reclaim the to-be-killed process.

Add a cond_resched() at the upper shrink_node_memcgs() to solve this
issue, this will mean that we will get a scheduling point for each memcg
in the reclaimed hierarchy without any dependency on the reclaimable
memory in that memcg thus making it more predictable.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vutung2311 added a commit to vutung2311/android_kernel_oneplus_sm8250 that referenced this issue Feb 27, 2021
WQ_MEM_RECLAIM cause this warning

[20210226_12:29:50.153395]@2 workqueue: WQ_MEM_RECLAIM mhi_w:mhi_pm_st_worker is flushing !WQ_MEM_RECLAIM events_highpri:flush_backlog
WARNING: CPU: 2 PID: 14326 at kernel/workqueue.c:2545 check_flush_dependency+0x118/0x120
[20210226_12:29:50.154078]@2 Modules linked in:
[20210226_12:29:50.154261]@2 CPU: 2 PID: 14326 Comm: kworker/u17:0 Tainted: GFS         O      4.19.113-perf+ OnePlusOSS#12
[20210226_12:29:50.154591]@2 Hardware name: Qualcomm Technologies, Inc. kona MTP dvt/mp 19811 14 15 52 53 (DT)
[20210226_12:29:50.154928]@2 Workqueue: mhi_w mhi_pm_st_worker
[20210226_12:29:50.155109]@2 pstate: 60c00085 (nZCv daIf +PAN +UAO)
[20210226_12:29:50.155437]@2 pc : check_flush_dependency+0x118/0x120
[20210226_12:29:50.155765]@2 lr : check_flush_dependency+0x118/0x120
[20210226_12:29:50.155939]@2 sp : ffffff803019b870
[20210226_12:29:50.156115]@2 x29: ffffff803019b870 x28: 0000000000000003
[20210226_12:29:50.156444]@2 x27: ffffff801d79f0d8 x26: ffffff801d79f098
[20210226_12:29:50.156771]@2 x25: ffffff9b26937620 x24: ffffff9b26937620
[20210226_12:29:50.156948]@2 x23: ffffffeb3fb23f00 x22: ffffffe957278000
[20210226_12:29:50.157272]@2 x21: ffffffeaece2db00 x20: ffffffe9bfc24a00
[20210226_12:29:50.157448]@2 x19: ffffff9b25646e94 x18: 0000000000000001
[20210226_12:29:50.157775]@2 x17: 0000000000000031 x16: ffffff9b258dde18
[20210226_12:29:50.157953]@2 x15: ffffff9b25fee7f7 x14: 775f74735f6d705f
[20210226_12:29:50.158131]@2 x13: 69686d3a775f6968 x12: 0000000000000000
[20210226_12:29:50.158461]@2 x11: 0000000000000000 x10: ffffffffffffffff
[20210226_12:29:50.158640]@2 x9 : 5950ec8a8f378000 x8 : 5950ec8a8f378000
[20210226_12:29:50.158969]@2 x7 : 0000000000000086 x6 : ffffff9b26d70136
[20210226_12:29:50.159148]@2 x5 : 000000000000001a x4 : 0000000000000008
[20210226_12:29:50.159475]@2 x3 : 000000000000676f x2 : fffffffffffffffe
[20210226_12:29:50.159803]@2 x1 : ffffff9b26d6d7fe x0 : 0000000000000086
[20210226_12:29:50.159980]@2 Call trace:
[20210226_12:29:50.160156]@2  check_flush_dependency+0x118/0x120
[20210226_12:29:50.160484]@2  __flush_work+0xc0/0x294
[20210226_12:29:50.160660]@2  flush_work+0x10/0x1c
[20210226_12:29:50.160986]@2  rollback_registered_many+0x208/0x61c
[20210226_12:29:50.161163]@2  unregister_netdevice_queue+0xe0/0x168
[20210226_12:29:50.161491]@2  unregister_netdev+0x20/0x30
[20210226_12:29:50.161670]@2  mhi_netdev_remove+0x1b0/0x20c
[20210226_12:29:50.162001]@2  mhi_driver_remove+0x2dc/0x3dc
[20210226_12:29:50.162183]@2  device_release_driver_internal+0x16c/0x220
[20210226_12:29:50.162505]@2  device_release_driver+0x14/0x1c
[20210226_12:29:50.162678]@2  bus_remove_device+0xd0/0xf4
[20210226_12:29:50.162998]@2  device_del+0x27c/0x4a8
[20210226_12:29:50.163175]@2  mhi_destroy_device+0x74/0xb0
[20210226_12:29:50.163495]@2  device_for_each_child+0x54/0xa4
[20210226_12:29:50.163672]@2  mhi_pm_st_worker+0x6d4/0xdd0
[20210226_12:29:50.164002]@2  process_one_work+0x1ec/0x438
[20210226_12:29:50.164181]@2  worker_thread+0x310/0x4b0
[20210226_12:29:50.164362]@2  kthread+0x114/0x124
[20210226_12:29:50.164695]@2  ret_from_fork+0x10/0x18
vutung2311 pushed a commit to vutung2311/android_kernel_oneplus_sm8250 that referenced this issue Mar 15, 2021
commit e3336ca upstream.

We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
doesn't have any reclaimable memory.

It can be easily reproduced as below:

  watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
  CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ OnePlusOSS#12
  Call Trace:
    shrink_lruvec+0x49f/0x640
    shrink_node+0x2a6/0x6f0
    do_try_to_free_pages+0xe9/0x3e0
    try_to_free_mem_cgroup_pages+0xef/0x1f0
    try_charge+0x2c1/0x750
    mem_cgroup_charge+0xd7/0x240
    __add_to_page_cache_locked+0x2fd/0x370
    add_to_page_cache_lru+0x4a/0xc0
    pagecache_get_page+0x10b/0x2f0
    filemap_fault+0x661/0xad0
    ext4_filemap_fault+0x2c/0x40
    __do_fault+0x4d/0xf9
    handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance for
oom reaper to run to reclaim the to-be-killed process.

Add a cond_resched() at the upper shrink_node_memcgs() to solve this
issue, this will mean that we will get a scheduling point for each memcg
in the reclaimed hierarchy without any dependency on the reclaimable
memory in that memcg thus making it more predictable.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vutung2311 added a commit to vutung2311/android_kernel_oneplus_sm8250 that referenced this issue Mar 15, 2021
WQ_MEM_RECLAIM cause this warning

[20210226_12:29:50.153395]@2 workqueue: WQ_MEM_RECLAIM mhi_w:mhi_pm_st_worker is flushing !WQ_MEM_RECLAIM events_highpri:flush_backlog
WARNING: CPU: 2 PID: 14326 at kernel/workqueue.c:2545 check_flush_dependency+0x118/0x120
[20210226_12:29:50.154078]@2 Modules linked in:
[20210226_12:29:50.154261]@2 CPU: 2 PID: 14326 Comm: kworker/u17:0 Tainted: GFS         O      4.19.113-perf+ OnePlusOSS#12
[20210226_12:29:50.154591]@2 Hardware name: Qualcomm Technologies, Inc. kona MTP dvt/mp 19811 14 15 52 53 (DT)
[20210226_12:29:50.154928]@2 Workqueue: mhi_w mhi_pm_st_worker
[20210226_12:29:50.155109]@2 pstate: 60c00085 (nZCv daIf +PAN +UAO)
[20210226_12:29:50.155437]@2 pc : check_flush_dependency+0x118/0x120
[20210226_12:29:50.155765]@2 lr : check_flush_dependency+0x118/0x120
[20210226_12:29:50.155939]@2 sp : ffffff803019b870
[20210226_12:29:50.156115]@2 x29: ffffff803019b870 x28: 0000000000000003
[20210226_12:29:50.156444]@2 x27: ffffff801d79f0d8 x26: ffffff801d79f098
[20210226_12:29:50.156771]@2 x25: ffffff9b26937620 x24: ffffff9b26937620
[20210226_12:29:50.156948]@2 x23: ffffffeb3fb23f00 x22: ffffffe957278000
[20210226_12:29:50.157272]@2 x21: ffffffeaece2db00 x20: ffffffe9bfc24a00
[20210226_12:29:50.157448]@2 x19: ffffff9b25646e94 x18: 0000000000000001
[20210226_12:29:50.157775]@2 x17: 0000000000000031 x16: ffffff9b258dde18
[20210226_12:29:50.157953]@2 x15: ffffff9b25fee7f7 x14: 775f74735f6d705f
[20210226_12:29:50.158131]@2 x13: 69686d3a775f6968 x12: 0000000000000000
[20210226_12:29:50.158461]@2 x11: 0000000000000000 x10: ffffffffffffffff
[20210226_12:29:50.158640]@2 x9 : 5950ec8a8f378000 x8 : 5950ec8a8f378000
[20210226_12:29:50.158969]@2 x7 : 0000000000000086 x6 : ffffff9b26d70136
[20210226_12:29:50.159148]@2 x5 : 000000000000001a x4 : 0000000000000008
[20210226_12:29:50.159475]@2 x3 : 000000000000676f x2 : fffffffffffffffe
[20210226_12:29:50.159803]@2 x1 : ffffff9b26d6d7fe x0 : 0000000000000086
[20210226_12:29:50.159980]@2 Call trace:
[20210226_12:29:50.160156]@2  check_flush_dependency+0x118/0x120
[20210226_12:29:50.160484]@2  __flush_work+0xc0/0x294
[20210226_12:29:50.160660]@2  flush_work+0x10/0x1c
[20210226_12:29:50.160986]@2  rollback_registered_many+0x208/0x61c
[20210226_12:29:50.161163]@2  unregister_netdevice_queue+0xe0/0x168
[20210226_12:29:50.161491]@2  unregister_netdev+0x20/0x30
[20210226_12:29:50.161670]@2  mhi_netdev_remove+0x1b0/0x20c
[20210226_12:29:50.162001]@2  mhi_driver_remove+0x2dc/0x3dc
[20210226_12:29:50.162183]@2  device_release_driver_internal+0x16c/0x220
[20210226_12:29:50.162505]@2  device_release_driver+0x14/0x1c
[20210226_12:29:50.162678]@2  bus_remove_device+0xd0/0xf4
[20210226_12:29:50.162998]@2  device_del+0x27c/0x4a8
[20210226_12:29:50.163175]@2  mhi_destroy_device+0x74/0xb0
[20210226_12:29:50.163495]@2  device_for_each_child+0x54/0xa4
[20210226_12:29:50.163672]@2  mhi_pm_st_worker+0x6d4/0xdd0
[20210226_12:29:50.164002]@2  process_one_work+0x1ec/0x438
[20210226_12:29:50.164181]@2  worker_thread+0x310/0x4b0
[20210226_12:29:50.164362]@2  kthread+0x114/0x124
[20210226_12:29:50.164695]@2  ret_from_fork+0x10/0x18
vutung2311 pushed a commit to vutung2311/android_kernel_oneplus_sm8250 that referenced this issue Jun 8, 2021
commit e3336ca upstream.

We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
doesn't have any reclaimable memory.

It can be easily reproduced as below:

  watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
  CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ OnePlusOSS#12
  Call Trace:
    shrink_lruvec+0x49f/0x640
    shrink_node+0x2a6/0x6f0
    do_try_to_free_pages+0xe9/0x3e0
    try_to_free_mem_cgroup_pages+0xef/0x1f0
    try_charge+0x2c1/0x750
    mem_cgroup_charge+0xd7/0x240
    __add_to_page_cache_locked+0x2fd/0x370
    add_to_page_cache_lru+0x4a/0xc0
    pagecache_get_page+0x10b/0x2f0
    filemap_fault+0x661/0xad0
    ext4_filemap_fault+0x2c/0x40
    __do_fault+0x4d/0xf9
    handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance for
oom reaper to run to reclaim the to-be-killed process.

Add a cond_resched() at the upper shrink_node_memcgs() to solve this
issue, this will mean that we will get a scheduling point for each memcg
in the reclaimed hierarchy without any dependency on the reclaimable
memory in that memcg thus making it more predictable.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vutung2311 added a commit to vutung2311/android_kernel_oneplus_sm8250 that referenced this issue Jun 8, 2021
WQ_MEM_RECLAIM cause this warning

[20210226_12:29:50.153395]@2 workqueue: WQ_MEM_RECLAIM mhi_w:mhi_pm_st_worker is flushing !WQ_MEM_RECLAIM events_highpri:flush_backlog
WARNING: CPU: 2 PID: 14326 at kernel/workqueue.c:2545 check_flush_dependency+0x118/0x120
[20210226_12:29:50.154078]@2 Modules linked in:
[20210226_12:29:50.154261]@2 CPU: 2 PID: 14326 Comm: kworker/u17:0 Tainted: GFS         O      4.19.113-perf+ OnePlusOSS#12
[20210226_12:29:50.154591]@2 Hardware name: Qualcomm Technologies, Inc. kona MTP dvt/mp 19811 14 15 52 53 (DT)
[20210226_12:29:50.154928]@2 Workqueue: mhi_w mhi_pm_st_worker
[20210226_12:29:50.155109]@2 pstate: 60c00085 (nZCv daIf +PAN +UAO)
[20210226_12:29:50.155437]@2 pc : check_flush_dependency+0x118/0x120
[20210226_12:29:50.155765]@2 lr : check_flush_dependency+0x118/0x120
[20210226_12:29:50.155939]@2 sp : ffffff803019b870
[20210226_12:29:50.156115]@2 x29: ffffff803019b870 x28: 0000000000000003
[20210226_12:29:50.156444]@2 x27: ffffff801d79f0d8 x26: ffffff801d79f098
[20210226_12:29:50.156771]@2 x25: ffffff9b26937620 x24: ffffff9b26937620
[20210226_12:29:50.156948]@2 x23: ffffffeb3fb23f00 x22: ffffffe957278000
[20210226_12:29:50.157272]@2 x21: ffffffeaece2db00 x20: ffffffe9bfc24a00
[20210226_12:29:50.157448]@2 x19: ffffff9b25646e94 x18: 0000000000000001
[20210226_12:29:50.157775]@2 x17: 0000000000000031 x16: ffffff9b258dde18
[20210226_12:29:50.157953]@2 x15: ffffff9b25fee7f7 x14: 775f74735f6d705f
[20210226_12:29:50.158131]@2 x13: 69686d3a775f6968 x12: 0000000000000000
[20210226_12:29:50.158461]@2 x11: 0000000000000000 x10: ffffffffffffffff
[20210226_12:29:50.158640]@2 x9 : 5950ec8a8f378000 x8 : 5950ec8a8f378000
[20210226_12:29:50.158969]@2 x7 : 0000000000000086 x6 : ffffff9b26d70136
[20210226_12:29:50.159148]@2 x5 : 000000000000001a x4 : 0000000000000008
[20210226_12:29:50.159475]@2 x3 : 000000000000676f x2 : fffffffffffffffe
[20210226_12:29:50.159803]@2 x1 : ffffff9b26d6d7fe x0 : 0000000000000086
[20210226_12:29:50.159980]@2 Call trace:
[20210226_12:29:50.160156]@2  check_flush_dependency+0x118/0x120
[20210226_12:29:50.160484]@2  __flush_work+0xc0/0x294
[20210226_12:29:50.160660]@2  flush_work+0x10/0x1c
[20210226_12:29:50.160986]@2  rollback_registered_many+0x208/0x61c
[20210226_12:29:50.161163]@2  unregister_netdevice_queue+0xe0/0x168
[20210226_12:29:50.161491]@2  unregister_netdev+0x20/0x30
[20210226_12:29:50.161670]@2  mhi_netdev_remove+0x1b0/0x20c
[20210226_12:29:50.162001]@2  mhi_driver_remove+0x2dc/0x3dc
[20210226_12:29:50.162183]@2  device_release_driver_internal+0x16c/0x220
[20210226_12:29:50.162505]@2  device_release_driver+0x14/0x1c
[20210226_12:29:50.162678]@2  bus_remove_device+0xd0/0xf4
[20210226_12:29:50.162998]@2  device_del+0x27c/0x4a8
[20210226_12:29:50.163175]@2  mhi_destroy_device+0x74/0xb0
[20210226_12:29:50.163495]@2  device_for_each_child+0x54/0xa4
[20210226_12:29:50.163672]@2  mhi_pm_st_worker+0x6d4/0xdd0
[20210226_12:29:50.164002]@2  process_one_work+0x1ec/0x438
[20210226_12:29:50.164181]@2  worker_thread+0x310/0x4b0
[20210226_12:29:50.164362]@2  kthread+0x114/0x124
[20210226_12:29:50.164695]@2  ret_from_fork+0x10/0x18
@swejuggalo
Copy link
Author

New discovery. Pretty sure it applies to both 8/9 non Pro and Pro non modified kernels.

Turning OFF Bubbles in system settings removes this issue for me. So bubbles is apparently a trigger of this behaviour.

Problem remains though. Nothing should be able to trigger the CPU to get stuck like this.

schqiushui pushed a commit to PixelExperiencePlus/android_kernel_oneplus_sm8250 that referenced this issue Mar 18, 2022
[ Upstream commit 4224cfd7fb6523f7a9d1c8bb91bb5df1e38eb624 ]

When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     OnePlusOSS#9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    OnePlusOSS#10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    OnePlusOSS#11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    OnePlusOSS#12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    OnePlusOSS#13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    OnePlusOSS#14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    OnePlusOSS#15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    OnePlusOSS#16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    OnePlusOSS#17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    OnePlusOSS#18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    OnePlusOSS#19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    OnePlusOSS#20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    OnePlusOSS#21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    OnePlusOSS#22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    OnePlusOSS#23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    OnePlusOSS#24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    OnePlusOSS#25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    OnePlusOSS#26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
andreock referenced this issue in andreock/android_kernel_realme_sm8250 Aug 31, 2022
[ Upstream commit 0e435be ]

crq->msgs could be NULL if the previous reset did not complete after
freeing crq->msgs. Check for NULL before dereferencing them.

Snippet of call trace:
...
ibmvnic 30000003 env3 (unregistering): Releasing sub-CRQ
ibmvnic 30000003 env3 (unregistering): Releasing CRQ
BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc0000000000c1a30
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: ibmvnic(E-) rpadlpar_io rpaphp xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables xsk_diag tcp_diag udp_diag tun raw_diag inet_diag unix_diag bridge af_packet_diag netlink_diag stp llc rfkill sunrpc pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ibmvnic]
CPU: 20 PID: 8426 Comm: kworker/20:0 Tainted: G            E     5.10.0-rc1+ #12
Workqueue: events __ibmvnic_reset [ibmvnic]
NIP:  c0000000000c1a30 LR: c008000001b00c18 CTR: 0000000000000400
REGS: c00000000d05b7a0 TRAP: 0380   Tainted: G            E      (5.10.0-rc1+)
MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44002480  XER: 20040000
CFAR: c0000000000c19ec IRQMASK: 0
GPR00: 0000000000000400 c00000000d05ba30 c008000001b17c00 0000000000000000
GPR04: 0000000000000000 0000000000000000 0000000000000000 00000000000001e2
GPR08: 000000000001f400 ffffffffffffd950 0000000000000000 c008000001b0b280
GPR12: c0000000000c19c8 c00000001ec72e00 c00000000019a778 c00000002647b440
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000006 0000000000000001 0000000000000003 0000000000000002
GPR24: 0000000000001000 c008000001b0d570 0000000000000005 c00000007ab5d550
GPR28: c00000007ab5c000 c000000032fcf848 c00000007ab5cc00 c000000032fcf800
NIP [c0000000000c1a30] memset+0x68/0x104
LR [c008000001b00c18] ibmvnic_reset_crq+0x70/0x110 [ibmvnic]
Call Trace:
[c00000000d05ba30] [0000000000000800] 0x800 (unreliable)
[c00000000d05bab0] [c008000001b0a930] do_reset.isra.40+0x224/0x634 [ibmvnic]
[c00000000d05bb80] [c008000001b08574] __ibmvnic_reset+0x17c/0x3c0 [ibmvnic]
[c00000000d05bc50] [c00000000018d9ac] process_one_work+0x2cc/0x800
[c00000000d05bd20] [c00000000018df58] worker_thread+0x78/0x520
[c00000000d05bdb0] [c00000000019a934] kthread+0x1c4/0x1d0
[c00000000d05be20] [c00000000000d5d0] ret_from_kernel_thread+0x5c/0x6c

Fixes: 032c5e8 ("Driver for IBM System i/p VNIC protocol")
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
andreock referenced this issue in andreock/android_kernel_realme_sm8250 Aug 31, 2022
[ Upstream commit c5c97cadd7ed13381cb6b4bef5c841a66938d350 ]

The ubsan reported the following error.  It was because sample's raw
data missed u32 padding at the end.  So it broke the alignment of the
array after it.

The raw data contains an u32 size prefix so the data size should have
an u32 padding after 8-byte aligned data.

27: Sample parsing  :util/synthetic-events.c:1539:4:
  runtime error: store to misaligned address 0x62100006b9bc for type
  '__u64' (aka 'unsigned long long'), which requires 8 byte alignment
0x62100006b9bc: note: pointer points here
  00 00 00 00 ff ff ff ff  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
              ^
    #0 0x561532a9fc96 in perf_event__synthesize_sample util/synthetic-events.c:1539:13
    #1 0x5615327f4a4f in do_test tests/sample-parsing.c:284:8
    #2 0x5615327f3f50 in test__sample_parsing tests/sample-parsing.c:381:9
    #3 0x56153279d3a1 in run_test tests/builtin-test.c:424:9
    #4 0x56153279c836 in test_and_print tests/builtin-test.c:454:9
    #5 0x56153279b7eb in __cmd_test tests/builtin-test.c:675:4
    #6 0x56153279abf0 in cmd_test tests/builtin-test.c:821:9
    #7 0x56153264e796 in run_builtin perf.c:312:11
    #8 0x56153264cf03 in handle_internal_command perf.c:364:8
    #9 0x56153264e47d in run_argv perf.c:408:2
    #10 0x56153264c9a9 in main perf.c:538:3
    #11 0x7f137ab6fbbc in __libc_start_main (/lib64/libc.so.6+0x38bbc)
    #12 0x561532596828 in _start ...

SUMMARY: UndefinedBehaviorSanitizer: misaligned-pointer-use
 util/synthetic-events.c:1539:4 in

Fixes: 045f8cd ("perf tests: Add a sample parsing test")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210214091638.519643-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
andreock referenced this issue in andreock/android_kernel_realme_sm8250 Sep 1, 2022
This patch is to fix a crash:

 #3 [ffffb6580689f898] oops_end at ffffffffa2835bc2
 #4 [ffffb6580689f8b8] no_context at ffffffffa28766e7
 #5 [ffffb6580689f920] async_page_fault at ffffffffa320135e
    [exception RIP: f2fs_is_compressed_page+34]
    RIP: ffffffffa2ba83a2  RSP: ffffb6580689f9d8  RFLAGS: 00010213
    RAX: 0000000000000001  RBX: fffffc0f50b34bc0  RCX: 0000000000002122
    RDX: 0000000000002123  RSI: 0000000000000c00  RDI: fffffc0f50b34bc0
    RBP: ffff97e815a40178   R8: 0000000000000000   R9: ffff97e83ffc9000
    R10: 0000000000032300  R11: 0000000000032380  R12: ffffb6580689fa38
    R13: fffffc0f50b34bc0  R14: ffff97e825cbd000  R15: 0000000000000c00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffffb6580689f9d8] __is_cp_guaranteed at ffffffffa2b7ea98
 #7 [ffffb6580689f9f0] f2fs_submit_page_write at ffffffffa2b81a69
 #8 [ffffb6580689fa30] f2fs_do_write_meta_page at ffffffffa2b99777
 #9 [ffffb6580689fae0] __f2fs_write_meta_page at ffffffffa2b75f1a
 #10 [ffffb6580689fb18] f2fs_sync_meta_pages at ffffffffa2b77466
 #11 [ffffb6580689fc98] do_checkpoint at ffffffffa2b78e46
 #12 [ffffb6580689fd88] f2fs_write_checkpoint at ffffffffa2b79c29
 #13 [ffffb6580689fdd0] f2fs_sync_fs at ffffffffa2b69d95
 #14 [ffffb6580689fe20] sync_filesystem at ffffffffa2ad2574
 #15 [ffffb6580689fe30] generic_shutdown_super at ffffffffa2a9b582
 #16 [ffffb6580689fe48] kill_block_super at ffffffffa2a9b6d1
 OnePlusOSS#17 [ffffb6580689fe60] kill_f2fs_super at ffffffffa2b6abe1
 OnePlusOSS#18 [ffffb6580689fea0] deactivate_locked_super at ffffffffa2a9afb6
 OnePlusOSS#19 [ffffb6580689feb8] cleanup_mnt at ffffffffa2abcad4
 OnePlusOSS#20 [ffffb6580689fee0] task_work_run at ffffffffa28bca28
 OnePlusOSS#21 [ffffb6580689ff00] exit_to_usermode_loop at ffffffffa28050b7
 OnePlusOSS#22 [ffffb6580689ff38] do_syscall_64 at ffffffffa280560e
 OnePlusOSS#23 [ffffb6580689ff50] entry_SYSCALL_64_after_hwframe at ffffffffa320008c

This occurred when umount f2fs if enable F2FS_FS_COMPRESSION
with F2FS_IO_TRACE. Fixes it by adding IS_IO_TRACED_PAGE to check
validity of pid for page_private.

Signed-off-by: Yu Changchun <yuchangchun1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
andreock referenced this issue in andreock/android_kernel_realme_sm8250 Sep 1, 2022
https://bugzilla.kernel.org/show_bug.cgi?id=208565

PID: 257    TASK: ecdd0000  CPU: 0   COMMAND: "init"
  #0 [<c0b420ec>] (__schedule) from [<c0b423c8>]
  #1 [<c0b423c8>] (schedule) from [<c0b459d4>]
  #2 [<c0b459d4>] (rwsem_down_read_failed) from [<c0b44fa0>]
  #3 [<c0b44fa0>] (down_read) from [<c044233c>]
  #4 [<c044233c>] (f2fs_truncate_blocks) from [<c0442890>]
  #5 [<c0442890>] (f2fs_truncate) from [<c044d408>]
  #6 [<c044d408>] (f2fs_evict_inode) from [<c030be18>]
  #7 [<c030be18>] (evict) from [<c030a558>]
  #8 [<c030a558>] (iput) from [<c047c600>]
  #9 [<c047c600>] (f2fs_sync_node_pages) from [<c0465414>]
 #10 [<c0465414>] (f2fs_write_checkpoint) from [<c04575f4>]
 #11 [<c04575f4>] (f2fs_sync_fs) from [<c0441918>]
 #12 [<c0441918>] (f2fs_do_sync_file) from [<c0441098>]
 #13 [<c0441098>] (f2fs_sync_file) from [<c0323fa0>]
 #14 [<c0323fa0>] (vfs_fsync_range) from [<c0324294>]
 #15 [<c0324294>] (do_fsync) from [<c0324014>]
 #16 [<c0324014>] (sys_fsync) from [<c0108bc0>]

This can be caused by flush_dirty_inode() in f2fs_sync_node_pages() where
iput() requires f2fs_lock_op() again resulting in livelock.

Reported-by: Zhiguo Niu <Zhiguo.Niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
schqiushui pushed a commit to PixelExperiencePlus/android_kernel_oneplus_sm8250 that referenced this issue Jan 19, 2023
…g the sock

[ Upstream commit 3cf7203ca620682165706f70a1b12b5194607dce ]

There is a race condition in vxlan that when deleting a vxlan device
during receiving packets, there is a possibility that the sock is
released after getting vxlan_sock vs from sk_user_data. Then in
later vxlan_ecn_decapsulate(), vxlan_get_sk_family() we will got
NULL pointer dereference. e.g.

   #0 [ffffa25ec6978a38] machine_kexec at ffffffff8c669757
   OnePlusOSS#1 [ffffa25ec6978a90] __crash_kexec at ffffffff8c7c0a4d
   OnePlusOSS#2 [ffffa25ec6978b58] crash_kexec at ffffffff8c7c1c48
   OnePlusOSS#3 [ffffa25ec6978b60] oops_end at ffffffff8c627f2b
   OnePlusOSS#4 [ffffa25ec6978b80] page_fault_oops at ffffffff8c678fcb
   OnePlusOSS#5 [ffffa25ec6978bd8] exc_page_fault at ffffffff8d109542
   OnePlusOSS#6 [ffffa25ec6978c00] asm_exc_page_fault at ffffffff8d200b62
      [exception RIP: vxlan_ecn_decapsulate+0x3b]
      RIP: ffffffffc1014e7b  RSP: ffffa25ec6978cb0  RFLAGS: 00010246
      RAX: 0000000000000008  RBX: ffff8aa000888000  RCX: 0000000000000000
      RDX: 000000000000000e  RSI: ffff8a9fc7ab803e  RDI: ffff8a9fd1168700
      RBP: ffff8a9fc7ab803e   R8: 0000000000700000   R9: 00000000000010ae
      R10: ffff8a9fcb748980  R11: 0000000000000000  R12: ffff8a9fd1168700
      R13: ffff8aa000888000  R14: 00000000002a0000  R15: 00000000000010ae
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   OnePlusOSS#7 [ffffa25ec6978ce8] vxlan_rcv at ffffffffc10189cd [vxlan]
   OnePlusOSS#8 [ffffa25ec6978d90] udp_queue_rcv_one_skb at ffffffff8cfb6507
   OnePlusOSS#9 [ffffa25ec6978dc0] udp_unicast_rcv_skb at ffffffff8cfb6e45
  OnePlusOSS#10 [ffffa25ec6978dc8] __udp4_lib_rcv at ffffffff8cfb8807
  OnePlusOSS#11 [ffffa25ec6978e20] ip_protocol_deliver_rcu at ffffffff8cf76951
  OnePlusOSS#12 [ffffa25ec6978e48] ip_local_deliver at ffffffff8cf76bde
  OnePlusOSS#13 [ffffa25ec6978ea0] __netif_receive_skb_one_core at ffffffff8cecde9b
  OnePlusOSS#14 [ffffa25ec6978ec8] process_backlog at ffffffff8cece139
  OnePlusOSS#15 [ffffa25ec6978f00] __napi_poll at ffffffff8ceced1a
  OnePlusOSS#16 [ffffa25ec6978f28] net_rx_action at ffffffff8cecf1f3
  OnePlusOSS#17 [ffffa25ec6978fa0] __softirqentry_text_start at ffffffff8d4000ca
  OnePlusOSS#18 [ffffa25ec6978ff0] do_softirq at ffffffff8c6fbdc3

Reproducer: https://github.com/Mellanox/ovs-tests/blob/master/test-ovs-vxlan-remove-tunnel-during-traffic.sh

Fix this by waiting for all sk_user_data reader to finish before
releasing the sock.

Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Fixes: 6a93cc9 ("udp-tunnel: Add a few more UDP tunnel APIs")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
sreelekshman pushed a commit to sreelekshman/android_kernel_realme_sm8250 that referenced this issue Jun 29, 2023
[ Upstream commit 05bb0167c80b8f93c6a4e0451b7da9b96db990c2 ]

ACPICA commit 770653e3ba67c30a629ca7d12e352d83c2541b1e

Before this change we see the following UBSAN stack trace in Fuchsia:

  #0    0x000021e4213b3302 in acpi_ds_init_aml_walk(struct acpi_walk_state*, union acpi_parse_object*, struct acpi_namespace_node*, u8*, u32, struct acpi_evaluate_info*, u8) ../../third_party/acpica/source/components/dispatcher/dswstate.c:682 <platform-bus-x86.so>+0x233302
  realme-sm8250-devs#1.2  0x000020d0f660777f in ubsan_get_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:41 <libclang_rt.asan.so>+0x3d77f
  realme-sm8250-devs#1.1  0x000020d0f660777f in maybe_print_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:51 <libclang_rt.asan.so>+0x3d77f
  realme-sm8250-devs#1    0x000020d0f660777f in ~scoped_report() compiler-rt/lib/ubsan/ubsan_diag.cpp:387 <libclang_rt.asan.so>+0x3d77f
  realme-sm8250-devs#2    0x000020d0f660b96d in handlepointer_overflow_impl() compiler-rt/lib/ubsan/ubsan_handlers.cpp:809 <libclang_rt.asan.so>+0x4196d
  realme-sm8250-devs#3    0x000020d0f660b50d in compiler-rt/lib/ubsan/ubsan_handlers.cpp:815 <libclang_rt.asan.so>+0x4150d
  OnePlusOSS#4    0x000021e4213b3302 in acpi_ds_init_aml_walk(struct acpi_walk_state*, union acpi_parse_object*, struct acpi_namespace_node*, u8*, u32, struct acpi_evaluate_info*, u8) ../../third_party/acpica/source/components/dispatcher/dswstate.c:682 <platform-bus-x86.so>+0x233302
  OnePlusOSS#5    0x000021e4213e2369 in acpi_ds_call_control_method(struct acpi_thread_state*, struct acpi_walk_state*, union acpi_parse_object*) ../../third_party/acpica/source/components/dispatcher/dsmethod.c:605 <platform-bus-x86.so>+0x262369
  OnePlusOSS#6    0x000021e421437fac in acpi_ps_parse_aml(struct acpi_walk_state*) ../../third_party/acpica/source/components/parser/psparse.c:550 <platform-bus-x86.so>+0x2b7fac
  OnePlusOSS#7    0x000021e4214464d2 in acpi_ps_execute_method(struct acpi_evaluate_info*) ../../third_party/acpica/source/components/parser/psxface.c:244 <platform-bus-x86.so>+0x2c64d2
  OnePlusOSS#8    0x000021e4213aa052 in acpi_ns_evaluate(struct acpi_evaluate_info*) ../../third_party/acpica/source/components/namespace/nseval.c:250 <platform-bus-x86.so>+0x22a052
  OnePlusOSS#9    0x000021e421413dd8 in acpi_ns_init_one_device(acpi_handle, u32, void*, void**) ../../third_party/acpica/source/components/namespace/nsinit.c:735 <platform-bus-x86.so>+0x293dd8
  OnePlusOSS#10   0x000021e421429e98 in acpi_ns_walk_namespace(acpi_object_type, acpi_handle, u32, u32, acpi_walk_callback, acpi_walk_callback, void*, void**) ../../third_party/acpica/source/components/namespace/nswalk.c:298 <platform-bus-x86.so>+0x2a9e98
  OnePlusOSS#11   0x000021e4214131ac in acpi_ns_initialize_devices(u32) ../../third_party/acpica/source/components/namespace/nsinit.c:268 <platform-bus-x86.so>+0x2931ac
  OnePlusOSS#12   0x000021e42147c40d in acpi_initialize_objects(u32) ../../third_party/acpica/source/components/utilities/utxfinit.c:304 <platform-bus-x86.so>+0x2fc40d
  OnePlusOSS#13   0x000021e42126d603 in acpi::acpi_impl::initialize_acpi(acpi::acpi_impl*) ../../src/devices/board/lib/acpi/acpi-impl.cc:224 <platform-bus-x86.so>+0xed603

Add a simple check that avoids incrementing a pointer by zero, but
otherwise behaves as before. Note that our findings are against ACPICA
20221020, but the same code exists on master.

Link: acpica/acpica@770653e3
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
schqiushui pushed a commit to PixelExperiencePlus/android_kernel_oneplus_sm8250 that referenced this issue Nov 7, 2023
[ Upstream commit 0b0747d507bffb827e40fc0f9fb5883fffc23477 ]

The following processes run into a deadlock. CPU 41 was waiting for CPU 29
to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29
was hung by that spinlock with IRQs disabled.

  PID: 17360    TASK: ffff95c1090c5c40  CPU: 41  COMMAND: "mrdiagd"
  !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0
  !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0
  !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0
   # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0
   # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0
   # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0
   # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0
   # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0
   # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0
   # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0
   OnePlusOSS#10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0
   OnePlusOSS#11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0
   OnePlusOSS#12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0
   OnePlusOSS#13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0
   OnePlusOSS#14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0
   OnePlusOSS#15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0
   OnePlusOSS#16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0
   OnePlusOSS#17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0
   OnePlusOSS#18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0
   OnePlusOSS#19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   OnePlusOSS#20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   OnePlusOSS#21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   OnePlusOSS#22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   OnePlusOSS#23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   OnePlusOSS#24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   OnePlusOSS#25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   OnePlusOSS#26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   OnePlusOSS#27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

  PID: 17355    TASK: ffff95c1090c3d80  CPU: 29  COMMAND: "mrdiagd"
  !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0
  !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0
   # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0
   # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0
   # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0
   # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0
   # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0
   # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0
   # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0
   # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   OnePlusOSS#10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   OnePlusOSS#11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   OnePlusOSS#12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   OnePlusOSS#13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   OnePlusOSS#14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   OnePlusOSS#15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   OnePlusOSS#16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   OnePlusOSS#17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

The lock is used to synchronize different sysfs operations, it doesn't
protect any resource that will be touched by an interrupt. Consequently
it's not required to disable IRQs. Replace the spinlock with a mutex to fix
the deadlock.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
schqiushui pushed a commit to PixelExperiencePlus/android_kernel_oneplus_sm8250 that referenced this issue Nov 7, 2023
[ Upstream commit a154f5f643c6ecddd44847217a7a3845b4350003 ]

The following call trace shows a deadlock issue due to recursive locking of
mutex "device_mutex". First lock acquire is in target_for_each_device() and
second in target_free_device().

 PID: 148266   TASK: ffff8be21ffb5d00  CPU: 10   COMMAND: "iscsi_ttx"
  #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f
  OnePlusOSS#1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224
  OnePlusOSS#2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee
  OnePlusOSS#3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7
  OnePlusOSS#4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3
  OnePlusOSS#5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c
  OnePlusOSS#6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod]
  OnePlusOSS#7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod]
  OnePlusOSS#8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f
  OnePlusOSS#9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583
 OnePlusOSS#10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod]
 OnePlusOSS#11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc
 OnePlusOSS#12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod]
 OnePlusOSS#13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod]
 OnePlusOSS#14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod]
 OnePlusOSS#15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod]
 OnePlusOSS#16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07
 OnePlusOSS#17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod]
 OnePlusOSS#18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod]
 OnePlusOSS#19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080
 OnePlusOSS#20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364

Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230918225848.66463-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant