Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixups to build against FreeBSD HEAD #2

Merged
merged 9 commits into from
Aug 23, 2017

Conversation

markjdb
Copy link
Contributor

@markjdb markjdb commented Aug 22, 2017

  • Move DRM headers to the right place.
  • Clean up module Makefiles.
  • Import some LinuxKPI and debugfs bits.

I've only tested this so far on amdgpu (with an RX 460).

@nomadlogic nomadlogic requested review from iotamudelta and removed request for iotamudelta August 22, 2017 23:53
@iotamudelta
Copy link
Member

Thanks! Merging and updating port now!

@iotamudelta iotamudelta merged commit 4ee5acb into FreeBSDDesktop:master Aug 23, 2017
johalun pushed a commit that referenced this pull request Feb 28, 2018
The rcu_barrier() takes the cpu_hotplug mutex which itself is not
reclaim-safe, and so rcu_barrier() is illegal from inside the shrinker.

[  309.661373] =========================================================
[  309.661376] [ INFO: possible irq lock inversion dependency detected ]
[  309.661380] 4.11.0-rc1-CI-CI_DRM_2333+ #1 Tainted: G        W
[  309.661383] ---------------------------------------------------------
[  309.661386] gem_exec_gttfil/6435 just changed the state of lock:
[  309.661389]  (rcu_preempt_state.barrier_mutex){+.+.-.}, at: [<ffffffff81100731>] _rcu_barrier+0x31/0x160
[  309.661399] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[  309.661402]  (cpu_hotplug.lock){+.+.+.}
[  309.661404]

               and interrupts could create inverse lock ordering between them.

[  309.661410]
               other info that might help us debug this:
[  309.661414]  Possible interrupt unsafe locking scenario:

[  309.661417]        CPU0                    CPU1
[  309.661419]        ----                    ----
[  309.661421]   lock(cpu_hotplug.lock);
[  309.661425]                                local_irq_disable();
[  309.661432]                                lock(rcu_preempt_state.barrier_mutex);
[  309.661441]                                lock(cpu_hotplug.lock);
[  309.661446]   <Interrupt>
[  309.661448]     lock(rcu_preempt_state.barrier_mutex);
[  309.661453]
                *** DEADLOCK ***

[  309.661460] 4 locks held by gem_exec_gttfil/6435:
[  309.661464]  #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff8120d83d>] vfs_write+0x17d/0x1f0
[  309.661475]  #1:  (debugfs_srcu){......}, at: [<ffffffff81320491>] debugfs_use_file_start+0x41/0xa0
[  309.661486]  #2:  (&attr->mutex){+.+.+.}, at: [<ffffffff8123a3e7>] simple_attr_write+0x37/0xe0
[  309.661495]  #3:  (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa0091b4a>] i915_drop_caches_set+0x3a/0x150 [i915]
[  309.661540]
               the shortest dependencies between 2nd lock and 1st lock:
[  309.661547]  -> (cpu_hotplug.lock){+.+.+.} ops: 829 {
[  309.661553]     HARDIRQ-ON-W at:
[  309.661560]                       __lock_acquire+0x5e5/0x1b50
[  309.661565]                       lock_acquire+0xc9/0x220
[  309.661572]                       __mutex_lock+0x6e/0x990
[  309.661576]                       mutex_lock_nested+0x16/0x20
[  309.661583]                       get_online_cpus+0x61/0x80
[  309.661590]                       kmem_cache_create+0x25/0x1d0
[  309.661596]                       debug_objects_mem_init+0x30/0x249
[  309.661602]                       start_kernel+0x341/0x3fe
[  309.661607]                       x86_64_start_reservations+0x2a/0x2c
[  309.661612]                       x86_64_start_kernel+0x173/0x186
[  309.661619]                       verify_cpu+0x0/0xfc
[  309.661622]     SOFTIRQ-ON-W at:
[  309.661627]                       __lock_acquire+0x611/0x1b50
[  309.661632]                       lock_acquire+0xc9/0x220
[  309.661636]                       __mutex_lock+0x6e/0x990
[  309.661641]                       mutex_lock_nested+0x16/0x20
[  309.661646]                       get_online_cpus+0x61/0x80
[  309.661650]                       kmem_cache_create+0x25/0x1d0
[  309.661655]                       debug_objects_mem_init+0x30/0x249
[  309.661660]                       start_kernel+0x341/0x3fe
[  309.661664]                       x86_64_start_reservations+0x2a/0x2c
[  309.661669]                       x86_64_start_kernel+0x173/0x186
[  309.661674]                       verify_cpu+0x0/0xfc
[  309.661677]     RECLAIM_FS-ON-W at:
[  309.661682]                          mark_held_locks+0x6f/0xa0
[  309.661687]                          lockdep_trace_alloc+0xb3/0x100
[  309.661693]                          kmem_cache_alloc_trace+0x31/0x2e0
[  309.661699]                          __smpboot_create_thread.part.1+0x27/0xe0
[  309.661704]                          smpboot_create_threads+0x61/0x90
[  309.661709]                          cpuhp_invoke_callback+0x9c/0x8a0
[  309.661713]                          cpuhp_up_callbacks+0x31/0xb0
[  309.661718]                          _cpu_up+0x7a/0xc0
[  309.661723]                          do_cpu_up+0x5f/0x80
[  309.661727]                          cpu_up+0xe/0x10
[  309.661734]                          smp_init+0x71/0xb3
[  309.661738]                          kernel_init_freeable+0x94/0x19e
[  309.661743]                          kernel_init+0x9/0xf0
[  309.661748]                          ret_from_fork+0x2e/0x40
[  309.661752]     INITIAL USE at:
[  309.661757]                      __lock_acquire+0x234/0x1b50
[  309.661761]                      lock_acquire+0xc9/0x220
[  309.661766]                      __mutex_lock+0x6e/0x990
[  309.661771]                      mutex_lock_nested+0x16/0x20
[  309.661775]                      get_online_cpus+0x61/0x80
[  309.661780]                      __cpuhp_setup_state+0x44/0x170
[  309.661785]                      page_alloc_init+0x23/0x3a
[  309.661790]                      start_kernel+0x124/0x3fe
[  309.661794]                      x86_64_start_reservations+0x2a/0x2c
[  309.661799]                      x86_64_start_kernel+0x173/0x186
[  309.661804]                      verify_cpu+0x0/0xfc
[  309.661807]   }
[  309.661813]   ... key      at: [<ffffffff81e37690>] cpu_hotplug+0xb0/0x100
[  309.661817]   ... acquired at:
[  309.661821]    lock_acquire+0xc9/0x220
[  309.661825]    __mutex_lock+0x6e/0x990
[  309.661829]    mutex_lock_nested+0x16/0x20
[  309.661833]    get_online_cpus+0x61/0x80
[  309.661837]    _rcu_barrier+0x9f/0x160
[  309.661841]    rcu_barrier+0x10/0x20
[  309.661847]    netdev_run_todo+0x5f/0x310
[  309.661852]    rtnl_unlock+0x9/0x10
[  309.661856]    default_device_exit_batch+0x133/0x150
[  309.661862]    ops_exit_list.isra.0+0x4d/0x60
[  309.661866]    cleanup_net+0x1d8/0x2c0
[  309.661872]    process_one_work+0x1f4/0x6d0
[  309.661876]    worker_thread+0x49/0x4a0
[  309.661881]    kthread+0x107/0x140
[  309.661884]    ret_from_fork+0x2e/0x40

[  309.661890] -> (rcu_preempt_state.barrier_mutex){+.+.-.} ops: 179 {
[  309.661896]    HARDIRQ-ON-W at:
[  309.661901]                     __lock_acquire+0x5e5/0x1b50
[  309.661905]                     lock_acquire+0xc9/0x220
[  309.661910]                     __mutex_lock+0x6e/0x990
[  309.661914]                     mutex_lock_nested+0x16/0x20
[  309.661919]                     _rcu_barrier+0x31/0x160
[  309.661923]                     rcu_barrier+0x10/0x20
[  309.661928]                     netdev_run_todo+0x5f/0x310
[  309.661932]                     rtnl_unlock+0x9/0x10
[  309.661936]                     default_device_exit_batch+0x133/0x150
[  309.661941]                     ops_exit_list.isra.0+0x4d/0x60
[  309.661946]                     cleanup_net+0x1d8/0x2c0
[  309.661951]                     process_one_work+0x1f4/0x6d0
[  309.661955]                     worker_thread+0x49/0x4a0
[  309.661960]                     kthread+0x107/0x140
[  309.661964]                     ret_from_fork+0x2e/0x40
[  309.661968]    SOFTIRQ-ON-W at:
[  309.661972]                     __lock_acquire+0x611/0x1b50
[  309.661977]                     lock_acquire+0xc9/0x220
[  309.661981]                     __mutex_lock+0x6e/0x990
[  309.661986]                     mutex_lock_nested+0x16/0x20
[  309.661990]                     _rcu_barrier+0x31/0x160
[  309.661995]                     rcu_barrier+0x10/0x20
[  309.661999]                     netdev_run_todo+0x5f/0x310
[  309.662003]                     rtnl_unlock+0x9/0x10
[  309.662008]                     default_device_exit_batch+0x133/0x150
[  309.662013]                     ops_exit_list.isra.0+0x4d/0x60
[  309.662017]                     cleanup_net+0x1d8/0x2c0
[  309.662022]                     process_one_work+0x1f4/0x6d0
[  309.662027]                     worker_thread+0x49/0x4a0
[  309.662031]                     kthread+0x107/0x140
[  309.662035]                     ret_from_fork+0x2e/0x40
[  309.662039]    IN-RECLAIM_FS-W at:
[  309.662043]                        __lock_acquire+0x638/0x1b50
[  309.662048]                        lock_acquire+0xc9/0x220
[  309.662053]                        __mutex_lock+0x6e/0x990
[  309.662058]                        mutex_lock_nested+0x16/0x20
[  309.662062]                        _rcu_barrier+0x31/0x160
[  309.662067]                        rcu_barrier+0x10/0x20
[  309.662089]                        i915_gem_shrink_all+0x33/0x40 [i915]
[  309.662109]                        i915_drop_caches_set+0x141/0x150 [i915]
[  309.662114]                        simple_attr_write+0xc7/0xe0
[  309.662119]                        full_proxy_write+0x4f/0x70
[  309.662124]                        __vfs_write+0x23/0x120
[  309.662128]                        vfs_write+0xc6/0x1f0
[  309.662133]                        SyS_write+0x44/0xb0
[  309.662138]                        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  309.662142]    INITIAL USE at:
[  309.662147]                    __lock_acquire+0x234/0x1b50
[  309.662151]                    lock_acquire+0xc9/0x220
[  309.662156]                    __mutex_lock+0x6e/0x990
[  309.662160]                    mutex_lock_nested+0x16/0x20
[  309.662165]                    _rcu_barrier+0x31/0x160
[  309.662169]                    rcu_barrier+0x10/0x20
[  309.662174]                    netdev_run_todo+0x5f/0x310
[  309.662178]                    rtnl_unlock+0x9/0x10
[  309.662183]                    default_device_exit_batch+0x133/0x150
[  309.662188]                    ops_exit_list.isra.0+0x4d/0x60
[  309.662192]                    cleanup_net+0x1d8/0x2c0
[  309.662197]                    process_one_work+0x1f4/0x6d0
[  309.662202]                    worker_thread+0x49/0x4a0
[  309.662206]                    kthread+0x107/0x140
[  309.662210]                    ret_from_fork+0x2e/0x40
[  309.662214]  }
[  309.662220]  ... key      at: [<ffffffff81e4e1c8>] rcu_preempt_state+0x508/0x780
[  309.662225]  ... acquired at:
[  309.662229]    check_usage_forwards+0x12b/0x130
[  309.662233]    mark_lock+0x360/0x6f0
[  309.662237]    __lock_acquire+0x638/0x1b50
[  309.662241]    lock_acquire+0xc9/0x220
[  309.662245]    __mutex_lock+0x6e/0x990
[  309.662249]    mutex_lock_nested+0x16/0x20
[  309.662253]    _rcu_barrier+0x31/0x160
[  309.662257]    rcu_barrier+0x10/0x20
[  309.662279]    i915_gem_shrink_all+0x33/0x40 [i915]
[  309.662298]    i915_drop_caches_set+0x141/0x150 [i915]
[  309.662303]    simple_attr_write+0xc7/0xe0
[  309.662307]    full_proxy_write+0x4f/0x70
[  309.662311]    __vfs_write+0x23/0x120
[  309.662315]    vfs_write+0xc6/0x1f0
[  309.662319]    SyS_write+0x44/0xb0
[  309.662323]    entry_SYSCALL_64_fastpath+0x1c/0xb1

[  309.662329]
               stack backtrace:
[  309.662335] CPU: 1 PID: 6435 Comm: gem_exec_gttfil Tainted: G        W       4.11.0-rc1-CI-CI_DRM_2333+ #1
[  309.662342] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011
[  309.662348] Call Trace:
[  309.662354]  dump_stack+0x67/0x92
[  309.662359]  print_irq_inversion_bug.part.19+0x1a4/0x1b0
[  309.662365]  check_usage_forwards+0x12b/0x130
[  309.662369]  mark_lock+0x360/0x6f0
[  309.662374]  ? print_shortest_lock_dependencies+0x1a0/0x1a0
[  309.662379]  __lock_acquire+0x638/0x1b50
[  309.662383]  ? __mutex_unlock_slowpath+0x3e/0x2e0
[  309.662388]  ? trace_hardirqs_on+0xd/0x10
[  309.662392]  ? _rcu_barrier+0x31/0x160
[  309.662396]  lock_acquire+0xc9/0x220
[  309.662400]  ? _rcu_barrier+0x31/0x160
[  309.662404]  ? _rcu_barrier+0x31/0x160
[  309.662409]  __mutex_lock+0x6e/0x990
[  309.662412]  ? _rcu_barrier+0x31/0x160
[  309.662416]  ? _rcu_barrier+0x31/0x160
[  309.662421]  ? synchronize_rcu_expedited+0x35/0xb0
[  309.662426]  ? _raw_spin_unlock_irqrestore+0x52/0x60
[  309.662434]  mutex_lock_nested+0x16/0x20
[  309.662438]  _rcu_barrier+0x31/0x160
[  309.662442]  rcu_barrier+0x10/0x20
[  309.662464]  i915_gem_shrink_all+0x33/0x40 [i915]
[  309.662484]  i915_drop_caches_set+0x141/0x150 [i915]
[  309.662489]  simple_attr_write+0xc7/0xe0
[  309.662494]  full_proxy_write+0x4f/0x70
[  309.662498]  __vfs_write+0x23/0x120
[  309.662503]  ? rcu_read_lock_sched_held+0x75/0x80
[  309.662507]  ? rcu_sync_lockdep_assert+0x2a/0x50
[  309.662512]  ? __sb_start_write+0x102/0x210
[  309.662516]  ? vfs_write+0x17d/0x1f0
[  309.662520]  vfs_write+0xc6/0x1f0
[  309.662524]  ? trace_hardirqs_on_caller+0xe7/0x200
[  309.662529]  SyS_write+0x44/0xb0
[  309.662533]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[  309.662537] RIP: 0033:0x7f507eac24a0
[  309.662541] RSP: 002b:00007fffda8720e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  309.662548] RAX: ffffffffffffffda RBX: ffffffff81482bd3 RCX: 00007f507eac24a0
[  309.662552] RDX: 0000000000000005 RSI: 00007fffda8720f0 RDI: 0000000000000005
[  309.662557] RBP: ffffc9000048bf88 R08: 0000000000000000 R09: 000000000000002c
[  309.662561] R10: 0000000000000014 R11: 0000000000000246 R12: 00007fffda872230
[  309.662566] R13: 00007fffda872228 R14: 0000000000000201 R15: 00007fffda8720f0
[  309.662572]  ? __this_cpu_preempt_check+0x13/0x20

Fixes: 0eafec6d3244 ("drm/i915: Enable lockless lookup of request tracking via RCU")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100192
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: <stable@vger.kernel.org> # v4.9+
Link: http://patchwork.freedesktop.org/patch/msgid/20170314115019.18127-1-chris@chris-wilson.co.uk
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
johalun pushed a commit that referenced this pull request Mar 1, 2018
this bug happened when amdgpu load failed.

[   75.740951] BUG: unable to handle kernel paging request at 00000000000031c0
[   75.748167] IP: [<ffffffffa064a0e0>] amdgpu_fbdev_restore_mode+0x20/0x60 [amdgpu]
[   75.755774] PGD 0

[   75.759185] Oops: 0000 [#1] SMP
[   75.762408] Modules linked in: amdgpu(OE-) ttm(OE) drm_kms_helper(OE) drm(OE) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) rpcsec_gss_krb5(E) nfsv4(E) nfs(E) fscache(E) eeepc_wmi(E) asus_wmi(E) sparse_keymap(E) intel_rapl(E) snd_hda_codec_hdmi(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_hwdep(E) snd_pcm(E) snd_seq_midi(E) coretemp(E) kvm_intel(E) snd_seq_midi_event(E) snd_rawmidi(E) kvm(E) snd_seq(E) joydev(E) snd_seq_device(E) snd_timer(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) mei_me(E) ghash_clmulni_intel(E) snd(E) aesni_intel(E) mei(E) soundcore(E) aes_x86_64(E) shpchp(E) serio_raw(E) lrw(E) acpi_pad(E) gf128mul(E) glue_helper(E) ablk_helper(E) mac_hid(E)
[   75.835574]  cryptd(E) parport_pc(E) ppdev(E) lp(E) nfsd(E) parport(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) autofs4(E) hid_generic(E) usbhid(E) mxm_wmi(E) psmouse(E) e1000e(E) ptp(E) pps_core(E) ahci(E) libahci(E) wmi(E) video(E) i2c_hid(E) hid(E)
[   75.858489] CPU: 5 PID: 1603 Comm: rmmod Tainted: G           OE   4.9.0-custom #2
[   75.866183] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 0901 08/31/2015
[   75.875050] task: ffff88045d1bbb80 task.stack: ffffc90002de4000
[   75.881094] RIP: 0010:[<ffffffffa064a0e0>]  [<ffffffffa064a0e0>] amdgpu_fbdev_restore_mode+0x20/0x60 [amdgpu]
[   75.891238] RSP: 0018:ffffc90002de7d48  EFLAGS: 00010286
[   75.896648] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[   75.903933] RDX: 0000000000000000 RSI: ffff88045d1bbb80 RDI: 0000000000000286
[   75.911183] RBP: ffffc90002de7d50 R08: 0000000000000502 R09: 0000000000000004
[   75.918449] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880464bf0000
[   75.925675] R13: ffffffffa0853000 R14: 0000000000000000 R15: 0000564e44f88210
[   75.932980] FS:  00007f13d5400700(0000) GS:ffff880476540000(0000) knlGS:0000000000000000
[   75.941238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.947088] CR2: 00000000000031c0 CR3: 000000045fd0b000 CR4: 00000000003406e0
[   75.954332] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   75.961566] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   75.968834] Stack:
[   75.970881]  ffff880464bf0000 ffffc90002de7d60 ffffffffa0636592 ffffc90002de7d80
[   75.978454]  ffffffffa059015f ffff880464bf0000 ffff880464bf0000 ffffc90002de7da8
[   75.986076]  ffffffffa0595216 ffff880464bf0000 ffff880460f4d000 ffffffffa0853000
[   75.993692] Call Trace:
[   75.996177]  [<ffffffffa0636592>] amdgpu_driver_lastclose_kms+0x12/0x20 [amdgpu]
[   76.003700]  [<ffffffffa059015f>] drm_lastclose+0x2f/0xd0 [drm]
[   76.009777]  [<ffffffffa0595216>] drm_dev_unregister+0x16/0xd0 [drm]
[   76.016255]  [<ffffffffa0595944>] drm_put_dev+0x34/0x70 [drm]
[   76.022139]  [<ffffffffa062f365>] amdgpu_pci_remove+0x15/0x20 [amdgpu]
[   76.028800]  [<ffffffff81416499>] pci_device_remove+0x39/0xc0
[   76.034661]  [<ffffffff81531caa>] __device_release_driver+0x9a/0x140
[   76.041121]  [<ffffffff81531e58>] driver_detach+0xb8/0xc0
[   76.046575]  [<ffffffff81530c95>] bus_remove_driver+0x55/0xd0
[   76.052401]  [<ffffffff815325fc>] driver_unregister+0x2c/0x50
[   76.058244]  [<ffffffff81416289>] pci_unregister_driver+0x29/0x90
[   76.064466]  [<ffffffffa0596c5e>] drm_pci_exit+0x9e/0xb0 [drm]
[   76.070507]  [<ffffffffa0796d71>] amdgpu_exit+0x1c/0x32 [amdgpu]
[   76.076609]  [<ffffffff81104810>] SyS_delete_module+0x1a0/0x200
[   76.082627]  [<ffffffff810e2b1a>] ? rcu_eqs_enter.isra.36+0x4a/0x50
[   76.089001]  [<ffffffff8100392e>] do_syscall_64+0x6e/0x180
[   76.094583]  [<ffffffff817e1d2f>] entry_SYSCALL64_slow_path+0x25/0x25
[   76.101114] Code: 94 c0 c3 31 c0 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 31 c0 48 89 e5 53 48 89 fb 48 c7 c7 1d 21 84 a0 e8 ab 77 b3 e0 e8 fc 8b d7 e0 <48> 8b bb c0 31 00 00 48 85 ff 74 09 e8 ff eb fc ff 85 c0 75 03
[   76.121432] RIP  [<ffffffffa064a0e0>] amdgpu_fbdev_restore_mode+0x20/0x60 [amdgpu]

Signed-off-by: Rex Zhu <Rex.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
johalun pushed a commit that referenced this pull request Mar 17, 2018
Pass crtc_state to the enable callback, and connector_state to all callbacks.
This will eliminate the need to guess for the correct pipe in these
callbacks.

The crtc state is required for pch_enable_backlight to obtain the correct
cpu_transcoder.

intel_dp_aux_backlight's setup function is called before hw readout, so
crtc_state and connector_state->best_encoder are NULL in the enable()
and set() callbacks.

This fixes the following series of warns from intel_get_pipe_from_connector:
[  219.968428] ------------[ cut here ]------------
[  219.968481] WARNING: CPU: 3 PID: 2457 at
drivers/gpu/drm/i915/intel_display.c:13881
intel_get_pipe_from_connector+0x62/0x90 [i915]
[  219.968483]
WARN_ON(!drm_modeset_is_locked(&dev->mode_config.connection_mutex))
[  219.968485] Modules linked in: nls_iso8859_1 snd_hda_codec_hdmi
snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec
snd_hda_core snd_hwdep snd_pcm intel_rapl x86_pkg_temp_thermal coretemp
kvm_intel snd_seq_midi snd_seq_midi_event kvm snd_rawmidi irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_seq
snd_seq_device serio_raw snd_timer aesni_intel aes_x86_64 crypto_simd
glue_helper cryptd lpc_ich snd mei_me shpchp soundcore mei rfkill_gpio
mac_hid intel_pmc_ipc parport_pc ppdev lp parport ip_tables x_tables
autofs4 hid_generic usbhid igb ahci i915 xhci_pci dca xhci_hcd ptp
sdhci_pci sdhci libahci pps_core i2c_hid hid video
[  219.968573] CPU: 3 PID: 2457 Comm: kworker/u8:3 Tainted: G        W
4.10.0-tip-201703010159+ #2
[  219.968575] Hardware name: Intel Corp. Broxton P/NOTEBOOK, BIOS
APLKRVPA.X64.0144.B10.1606270006 06/27/2016
[  219.968627] Workqueue: events_unbound intel_atomic_commit_work [i915]
[  219.968629] Call Trace:
[  219.968640]  dump_stack+0x63/0x87
[  219.968646]  __warn+0xd1/0xf0
[  219.968651]  warn_slowpath_fmt+0x4f/0x60
[  219.968657]  ? drm_printk+0x97/0xa0
[  219.968708]  intel_get_pipe_from_connector+0x62/0x90 [i915]
[  219.968756]  intel_panel_enable_backlight+0x19/0xf0 [i915]
[  219.968804]  intel_edp_backlight_on.part.22+0x33/0x40 [i915]
[  219.968852]  intel_edp_backlight_on+0x18/0x20 [i915]
[  219.968900]  intel_enable_ddi+0x94/0xc0 [i915]
[  219.968950]  intel_encoders_enable.isra.93+0x77/0x90 [i915]
[  219.969000]  haswell_crtc_enable+0x310/0x7f0 [i915]
[  219.969051]  intel_update_crtc+0x58/0x100 [i915]
[  219.969101]  skl_update_crtcs+0x218/0x240 [i915]
[  219.969153]  intel_atomic_commit_tail+0x350/0x1000 [i915]
[  219.969159]  ? vtime_account_idle+0xe/0x50
[  219.969164]  ? finish_task_switch+0x107/0x250
[  219.969214]  intel_atomic_commit_work+0x12/0x20 [i915]
[  219.969219]  process_one_work+0x153/0x3f0
[  219.969223]  worker_thread+0x12b/0x4b0
[  219.969227]  kthread+0x101/0x140
[  219.969230]  ? rescuer_thread+0x340/0x340
[  219.969233]  ? kthread_park+0x90/0x90
[  219.969237]  ? do_syscall_64+0x6e/0x180
[  219.969243]  ret_from_fork+0x2c/0x40
[  219.969246] ---[ end trace 0a8fa19387b9ad6d ]---

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100022
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170612102115.23665-4-maarten.lankhorst@linux.intel.com
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
johalun pushed a commit that referenced this pull request Mar 17, 2018
vfio_unpin_pages will hold a read semaphore however it is already hold
in the same thread by vfio ioctl. It will cause below warning:

[ 5102.127454] ============================================
[ 5102.133379] WARNING: possible recursive locking detected
[ 5102.139304] 4.12.0-rc4+ #3 Not tainted
[ 5102.143483] --------------------------------------------
[ 5102.149407] qemu-system-x86/1620 is trying to acquire lock:
[ 5102.155624]  (&container->group_lock){++++++}, at: [<ffffffff817768c6>] vfio_unpin_pages+0x96/0xf0
[ 5102.165626]
but task is already holding lock:
[ 5102.172134]  (&container->group_lock){++++++}, at: [<ffffffff8177728f>] vfio_fops_unl_ioctl+0x5f/0x280
[ 5102.182522]
other info that might help us debug this:
[ 5102.189806]  Possible unsafe locking scenario:

[ 5102.196411]        CPU0
[ 5102.199136]        ----
[ 5102.201861]   lock(&container->group_lock);
[ 5102.206527]   lock(&container->group_lock);
[ 5102.211191]
*** DEADLOCK ***

[ 5102.217796]  May be due to missing lock nesting notation

[ 5102.225370] 3 locks held by qemu-system-x86/1620:
[ 5102.230618]  #0:  (&container->group_lock){++++++}, at: [<ffffffff8177728f>] vfio_fops_unl_ioctl+0x5f/0x280
[ 5102.241482]  #1:  (&(&iommu->notifier)->rwsem){++++..}, at: [<ffffffff810de775>] __blocking_notifier_call_chain+0x35/0x70
[ 5102.253713]  #2:  (&vgpu->vdev.cache_lock){+.+...}, at: [<ffffffff8157b007>] intel_vgpu_iommu_notifier+0x77/0x120
[ 5102.265163]
stack backtrace:
[ 5102.270022] CPU: 5 PID: 1620 Comm: qemu-system-x86 Not tainted 4.12.0-rc4+ #3
[ 5102.277991] Hardware name: Intel Corporation S1200RP/S1200RP, BIOS S1200RP.86B.03.01.APER.061220151418 06/12/2015
[ 5102.289445] Call Trace:
[ 5102.292175]  dump_stack+0x85/0xc7
[ 5102.295871]  validate_chain.isra.21+0x9da/0xaf0
[ 5102.300925]  __lock_acquire+0x405/0x820
[ 5102.305202]  lock_acquire+0xc7/0x220
[ 5102.309191]  ? vfio_unpin_pages+0x96/0xf0
[ 5102.313666]  down_read+0x2b/0x50
[ 5102.317259]  ? vfio_unpin_pages+0x96/0xf0
[ 5102.321732]  vfio_unpin_pages+0x96/0xf0
[ 5102.326024]  intel_vgpu_iommu_notifier+0xe5/0x120
[ 5102.331283]  notifier_call_chain+0x4a/0x70
[ 5102.335851]  __blocking_notifier_call_chain+0x4d/0x70
[ 5102.341490]  blocking_notifier_call_chain+0x16/0x20
[ 5102.346935]  vfio_iommu_type1_ioctl+0x87b/0x920
[ 5102.351994]  vfio_fops_unl_ioctl+0x81/0x280
[ 5102.356660]  ? __fget+0xf0/0x210
[ 5102.360261]  do_vfs_ioctl+0x93/0x6a0
[ 5102.364247]  ? __fget+0x111/0x210
[ 5102.367942]  SyS_ioctl+0x41/0x70
[ 5102.371542]  entry_SYSCALL_64_fastpath+0x1f/0xbe

put the vfio_unpin_pages in a workqueue can fix this.

v2:
- use for style instead of do{}while(1). (Zhenyu)
v3:
- rename gvt_cache_mark to gvt_cache_mark_remove. (Zhenyu)

Fixes: 659643f7d814 ("drm/i915/gvt/kvmgt: add vfio/mdev support to KVMGT")
Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
johalun pushed a commit that referenced this pull request Apr 11, 2018
The dead circular lock senario captured is as followed.
The idea of the fix is moving read_user_wptr outside of
acquire_queue...release_queue critical section

[   63.477482] WARNING: possible circular locking dependency detected
[   63.484091] 4.12.0-kfd-ozeng #3 Not tainted
[   63.488531] ------------------------------------------------------
[   63.495146] HelloWorldLoop/2526 is trying to acquire lock:
[   63.501011]  (&mm->mmap_sem){++++++}, at: [<ffffffff911898ce>] __might_fault+0x3e/0x90
[   63.509472]
               but task is already holding lock:
[   63.515716]  (&adev->srbm_mutex){+.+...}, at: [<ffffffffc0484feb>] lock_srbm+0x2b/0x50 [amdgpu]
[   63.525099]
               which lock already depends on the new lock.

[   63.533841]
               the existing dependency chain (in reverse order) is:
[   63.541839]
               -> #2 (&adev->srbm_mutex){+.+...}:
[   63.548178]        lock_acquire+0x6d/0x90
[   63.552461]        __mutex_lock+0x70/0x8c0
[   63.556826]        mutex_lock_nested+0x16/0x20
[   63.561603]        gfx_v8_0_kiq_resume+0x1039/0x14a0 [amdgpu]
[   63.567817]        gfx_v8_0_hw_init+0x204d/0x2210 [amdgpu]
[   63.573675]        amdgpu_device_init+0xdea/0x1790 [amdgpu]
[   63.579640]        amdgpu_driver_load_kms+0x63/0x220 [amdgpu]
[   63.585743]        drm_dev_register+0x145/0x1e0
[   63.590605]        amdgpu_pci_probe+0x11e/0x160 [amdgpu]
[   63.596266]        local_pci_probe+0x40/0xa0
[   63.600803]        pci_device_probe+0x134/0x150
[   63.605650]        driver_probe_device+0x2a1/0x460
[   63.610785]        __driver_attach+0xdc/0xe0
[   63.615321]        bus_for_each_dev+0x5f/0x90
[   63.619984]        driver_attach+0x19/0x20
[   63.624337]        bus_add_driver+0x40/0x270
[   63.628908]        driver_register+0x5b/0xe0
[   63.633446]        __pci_register_driver+0x5b/0x60
[   63.638586]        rtsx_pci_switch_output_voltage+0x1d/0x20 [rtsx_pci]
[   63.645564]        do_one_initcall+0x4c/0x1b0
[   63.650205]        do_init_module+0x56/0x1ea
[   63.654767]        load_module+0x208c/0x27d0
[   63.659335]        SYSC_finit_module+0x96/0xd0
[   63.664058]        SyS_finit_module+0x9/0x10
[   63.668629]        entry_SYSCALL_64_fastpath+0x1f/0xbe
[   63.674088]
               -> #1 (reservation_ww_class_mutex){+.+.+.}:
[   63.681257]        lock_acquire+0x6d/0x90
[   63.685551]        __ww_mutex_lock.constprop.11+0x8c/0xed0
[   63.691426]        ww_mutex_lock+0x67/0x70
[   63.695802]        amdgpu_verify_access+0x6d/0x100 [amdgpu]
[   63.701743]        ttm_bo_mmap+0x8e/0x100 [ttm]
[   63.706615]        amdgpu_bo_mmap+0xd/0x60 [amdgpu]
[   63.711814]        amdgpu_mmap+0x35/0x40 [amdgpu]
[   63.716904]        mmap_region+0x3b5/0x5a0
[   63.721255]        do_mmap+0x400/0x4d0
[   63.725260]        vm_mmap_pgoff+0xb0/0xf0
[   63.729625]        SyS_mmap_pgoff+0x19e/0x260
[   63.734292]        SyS_mmap+0x1d/0x20
[   63.738199]        entry_SYSCALL_64_fastpath+0x1f/0xbe
[   63.743681]
               -> #0 (&mm->mmap_sem){++++++}:
[   63.749641]        __lock_acquire+0x1401/0x1420
[   63.754491]        lock_acquire+0x6d/0x90
[   63.758750]        __might_fault+0x6b/0x90
[   63.763176]        kgd_hqd_load+0x24f/0x270 [amdgpu]
[   63.768432]        load_mqd+0x4b/0x50 [amdkfd]
[   63.773192]        create_queue_nocpsch+0x535/0x620 [amdkfd]
[   63.779237]        pqm_create_queue+0x34d/0x4f0 [amdkfd]
[   63.784835]        kfd_ioctl_create_queue+0x282/0x670 [amdkfd]
[   63.790973]        kfd_ioctl+0x310/0x4d0 [amdkfd]
[   63.795944]        do_vfs_ioctl+0x90/0x6e0
[   63.800268]        SyS_ioctl+0x74/0x80
[   63.804207]        entry_SYSCALL_64_fastpath+0x1f/0xbe
[   63.809607]
               other info that might help us debug this:

[   63.818026] Chain exists of:
                 &mm->mmap_sem --> reservation_ww_class_mutex --> &adev->srbm_mutex

[   63.830382]  Possible unsafe locking scenario:

[   63.836605]        CPU0                    CPU1
[   63.841364]        ----                    ----
[   63.846123]   lock(&adev->srbm_mutex);
[   63.850061]                                lock(reservation_ww_class_mutex);
[   63.857475]                                lock(&adev->srbm_mutex);
[   63.864084]   lock(&mm->mmap_sem);
[   63.867657]
                *** DEADLOCK ***

[   63.873884] 3 locks held by HelloWorldLoop/2526:
[   63.878739]  #0:  (&process->mutex){+.+.+.}, at: [<ffffffffc06e1a9a>] kfd_ioctl_create_queue+0x24a/0x670 [amdkfd]
[   63.889543]  #1:  (&dqm->lock){+.+...}, at: [<ffffffffc06eedeb>] create_queue_nocpsch+0x3b/0x620 [amdkfd]
[   63.899684]  #2:  (&adev->srbm_mutex){+.+...}, at: [<ffffffffc0484feb>] lock_srbm+0x2b/0x50 [amdgpu]
[   63.909500]
               stack backtrace:
[   63.914187] CPU: 3 PID: 2526 Comm: HelloWorldLoop Not tainted 4.12.0-kfd-ozeng #3
[   63.922184] Hardware name: AMD Carrizo/Gardenia, BIOS WGA5819N_Weekly_15_08_1 08/19/2015
[   63.930865] Call Trace:
[   63.933464]  dump_stack+0x85/0xc9
[   63.936999]  print_circular_bug+0x1f9/0x207
[   63.941442]  __lock_acquire+0x1401/0x1420
[   63.945745]  ? lock_srbm+0x2b/0x50 [amdgpu]
[   63.950185]  lock_acquire+0x6d/0x90
[   63.953885]  ? __might_fault+0x3e/0x90
[   63.957899]  __might_fault+0x6b/0x90
[   63.961699]  ? __might_fault+0x3e/0x90
[   63.965755]  kgd_hqd_load+0x24f/0x270 [amdgpu]
[   63.970577]  load_mqd+0x4b/0x50 [amdkfd]
[   63.974745]  create_queue_nocpsch+0x535/0x620 [amdkfd]
[   63.980242]  pqm_create_queue+0x34d/0x4f0 [amdkfd]
[   63.985320]  kfd_ioctl_create_queue+0x282/0x670 [amdkfd]
[   63.991021]  kfd_ioctl+0x310/0x4d0 [amdkfd]
[   63.995499]  ? kfd_ioctl_destroy_queue+0x70/0x70 [amdkfd]
[   64.001234]  do_vfs_ioctl+0x90/0x6e0
[   64.005065]  ? up_read+0x1a/0x40
[   64.008496]  SyS_ioctl+0x74/0x80
[   64.011955]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   64.016863] RIP: 0033:0x7f4b3bd35f07
[   64.020696] RSP: 002b:00007ffe7689ec38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   64.028786] RAX: ffffffffffffffda RBX: 00000000002a2000 RCX: 00007f4b3bd35f07
[   64.036414] RDX: 00007ffe7689ecb0 RSI: 00000000c0584b02 RDI: 0000000000000005
[   64.044045] RBP: 00007f4a3212d000 R08: 00007f4b3c919000 R09: 0000000000080000
[   64.051674] R10: 00007f4b376b64b8 R11: 0000000000000246 R12: 00007f4a3212d000
[   64.059324] R13: 0000000000000015 R14: 0000000000000064 R15: 00007ffe7689ef50

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
johalun pushed a commit that referenced this pull request Apr 11, 2018
…ug deadlock

4.14-rc1 gained the fancy new cross-release support in lockdep, which
seems to have uncovered a few more rules about what is allowed and
isn't.

This one here seems to indicate that allocating a work-queue while
holding mmap_sem is a no-go, so let's try to preallocate it.

Of course another way to break this chain would be somewhere in the
cpu hotplug code, since this isn't the only trace we're finding now
which goes through msr_create_device.

Full lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
4.14.0-rc1-CI-CI_DRM_3118+ #1 Tainted: G     U
------------------------------------------------------
prime_mmap/1551 is trying to acquire lock:
 (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8109dbb7>] apply_workqueue_attrs+0x17/0x50

but task is already holding lock:
 (&dev_priv->mm_lock){+.+.}, at: [<ffffffffa01a7b2a>] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #6 (&dev_priv->mm_lock){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_nested+0x1b/0x20
       i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]
       i915_gem_userptr_ioctl+0x222/0x2c0 [i915]
       drm_ioctl_kernel+0x69/0xb0
       drm_ioctl+0x2f9/0x3d0
       do_vfs_ioctl+0x94/0x670
       SyS_ioctl+0x41/0x70
       entry_SYSCALL_64_fastpath+0x1c/0xb1

-> #5 (&mm->mmap_sem){++++}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __might_fault+0x68/0x90
       _copy_to_user+0x23/0x70
       filldir+0xa5/0x120
       dcache_readdir+0xf9/0x170
       iterate_dir+0x69/0x1a0
       SyS_getdents+0xa5/0x140
       entry_SYSCALL_64_fastpath+0x1c/0xb1

-> #4 (&sb->s_type->i_mutex_key#5){++++}:
       down_write+0x3b/0x70
       handle_create+0xcb/0x1e0
       devtmpfsd+0x139/0x180
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #3 ((complete)&req.done){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       wait_for_common+0x58/0x210
       wait_for_completion+0x1d/0x20
       devtmpfs_create_node+0x13d/0x160
       device_add+0x5eb/0x620
       device_create_groups_vargs+0xe0/0xf0
       device_create+0x3a/0x40
       msr_device_create+0x2b/0x40
       cpuhp_invoke_callback+0xa3/0x840
       cpuhp_thread_fun+0x7a/0x150
       smpboot_thread_fn+0x18a/0x280
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #2 (cpuhp_state){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpuhp_issue_call+0x10b/0x170
       __cpuhp_setup_state_cpuslocked+0x134/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_writeback_init+0x43/0x67
       pagecache_init+0x3d/0x42
       start_kernel+0x3a8/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #1 (cpuhp_state_mutex){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_nested+0x1b/0x20
       __cpuhp_setup_state_cpuslocked+0x52/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_alloc_init+0x28/0x30
       start_kernel+0x145/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #0 (cpu_hotplug_lock.rw_sem){++++}:
       check_prev_add+0x430/0x840
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpus_read_lock+0x3d/0xb0
       apply_workqueue_attrs+0x17/0x50
       __alloc_workqueue_key+0x1d8/0x4d9
       i915_gem_userptr_init__mmu_notifier+0x1fb/0x270 [i915]
       i915_gem_userptr_ioctl+0x222/0x2c0 [i915]
       drm_ioctl_kernel+0x69/0xb0
       drm_ioctl+0x2f9/0x3d0
       do_vfs_ioctl+0x94/0x670
       SyS_ioctl+0x41/0x70
       entry_SYSCALL_64_fastpath+0x1c/0xb1

other info that might help us debug this:

Chain exists of:
  cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev_priv->mm_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&dev_priv->mm_lock);
                               lock(&mm->mmap_sem);
                               lock(&dev_priv->mm_lock);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

2 locks held by prime_mmap/1551:
 #0:  (&mm->mmap_sem){++++}, at: [<ffffffffa01a7b18>] i915_gem_userptr_init__mmu_notifier+0x138/0x270 [i915]
 #1:  (&dev_priv->mm_lock){+.+.}, at: [<ffffffffa01a7b2a>] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]

stack backtrace:
CPU: 4 PID: 1551 Comm: prime_mmap Tainted: G     U          4.14.0-rc1-CI-CI_DRM_3118+ #1
Hardware name: Dell Inc. XPS 8300  /0Y2MRG, BIOS A06 10/17/2011
Call Trace:
 dump_stack+0x68/0x9f
 print_circular_bug+0x235/0x3c0
 ? lockdep_init_map_crosslock+0x20/0x20
 check_prev_add+0x430/0x840
 __lock_acquire+0x1420/0x15e0
 ? __lock_acquire+0x1420/0x15e0
 ? lockdep_init_map_crosslock+0x20/0x20
 lock_acquire+0xb0/0x200
 ? apply_workqueue_attrs+0x17/0x50
 cpus_read_lock+0x3d/0xb0
 ? apply_workqueue_attrs+0x17/0x50
 apply_workqueue_attrs+0x17/0x50
 __alloc_workqueue_key+0x1d8/0x4d9
 ? __lockdep_init_map+0x57/0x1c0
 i915_gem_userptr_init__mmu_notifier+0x1fb/0x270 [i915]
 i915_gem_userptr_ioctl+0x222/0x2c0 [i915]
 ? i915_gem_userptr_release+0x140/0x140 [i915]
 drm_ioctl_kernel+0x69/0xb0
 drm_ioctl+0x2f9/0x3d0
 ? i915_gem_userptr_release+0x140/0x140 [i915]
 ? __do_page_fault+0x2a4/0x570
 do_vfs_ioctl+0x94/0x670
 ? entry_SYSCALL_64_fastpath+0x5/0xb1
 ? __this_cpu_preempt_check+0x13/0x20
 ? trace_hardirqs_on_caller+0xe3/0x1b0
 SyS_ioctl+0x41/0x70
 entry_SYSCALL_64_fastpath+0x1c/0xb1
RIP: 0033:0x7fbb83c39587
RSP: 002b:00007fff188dc228 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: ffffffff81492963 RCX: 00007fbb83c39587
RDX: 00007fff188dc260 RSI: 00000000c0186473 RDI: 0000000000000003
RBP: ffffc90001487f88 R08: 0000000000000000 R09: 00007fff188dc2ac
R10: 00007fbb83efcb58 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000003 R14: 00000000c0186473 R15: 00007fff188dc2ac
 ? __this_cpu_preempt_check+0x13/0x20

Note that this also has the minor benefit of slightly reducing the
critical section where we hold mmap_sem.

v2: Set ret correctly when we raced with another thread.

v3: Use Chris' diff. Attach the right lockdep splat.

v4: Repaint in Tvrtko's colors (aka don't report ENOMEM if we race and
some other thread managed to not also get an ENOMEM and successfully
install the mmu notifier. Note that the kernel guarantees that small
allocations succeed, so this never actually happens).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sasha Levin <alexander.levin@verizon.com>
Cc: Marta Lofstedt <marta.lofstedt@intel.com>
Cc: Tejun Heo <tj@kernel.org>
References: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3180/shard-hsw3/igt@prime_mmap@test_userptr.html
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102939
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171009164401.16035-1-daniel.vetter@ffwll.ch
johalun pushed a commit that referenced this pull request Apr 11, 2018
stop_machine is not really a locking primitive we should use, except
when the hw folks tell us the hw is broken and that's the only way to
work around it.

This patch tries to address the locking abuse of stop_machine() from

commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 22 14:41:21 2016 +0000

    drm/i915: Stop the machine as we install the wedged submit_request handler

Chris said parts of the reasons for going with stop_machine() was that
it's no overhead for the fast-path. But these callbacks use irqsave
spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.

To stay as close as possible to the stop_machine semantics we first
update all the submit function pointers to the nop handler, then call
synchronize_rcu() to make sure no new requests can be submitted. This
should give us exactly the huge barrier we want.

I pondered whether we should annotate engine->submit_request as __rcu
and use rcu_assign_pointer and rcu_dereference on it. But the reason
behind those is to make sure the compiler/cpu barriers are there for
when you have an actual data structure you point at, to make sure all
the writes are seen correctly on the read side. But we just have a
function pointer, and .text isn't changed, so no need for these
barriers and hence no need for annotations.

Unfortunately there's a complication with the call to
intel_engine_init_global_seqno:

- Without stop_machine we must hold the corresponding spinlock.

- Without stop_machine we must ensure that all requests are marked as
  having failed with dma_fence_set_error() before we call it. That
  means we need to split the nop request submission into two phases,
  both synchronized with rcu:

  1. Only stop submitting the requests to hw and mark them as failed.

  2. After all pending requests in the scheduler/ring are suitably
  marked up as failed and we can force complete them all, also force
  complete by calling intel_engine_init_global_seqno().

This should fix the followwing lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G     U
------------------------------------------------------
kworker/3:4/562 is trying to acquire lock:
 (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40

but task is already holding lock:
 (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #6 (&dev->struct_mutex){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_interruptible_nested+0x1b/0x20
       i915_mutex_lock_interruptible+0x51/0x130 [i915]
       i915_gem_fault+0x209/0x650 [i915]
       __do_fault+0x1e/0x80
       __handle_mm_fault+0xa08/0xed0
       handle_mm_fault+0x156/0x300
       __do_page_fault+0x2c5/0x570
       do_page_fault+0x28/0x250
       page_fault+0x22/0x30

-> #5 (&mm->mmap_sem){++++}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __might_fault+0x68/0x90
       _copy_to_user+0x23/0x70
       filldir+0xa5/0x120
       dcache_readdir+0xf9/0x170
       iterate_dir+0x69/0x1a0
       SyS_getdents+0xa5/0x140
       entry_SYSCALL_64_fastpath+0x1c/0xb1

-> #4 (&sb->s_type->i_mutex_key#5){++++}:
       down_write+0x3b/0x70
       handle_create+0xcb/0x1e0
       devtmpfsd+0x139/0x180
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #3 ((complete)&req.done){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       wait_for_common+0x58/0x210
       wait_for_completion+0x1d/0x20
       devtmpfs_create_node+0x13d/0x160
       device_add+0x5eb/0x620
       device_create_groups_vargs+0xe0/0xf0
       device_create+0x3a/0x40
       msr_device_create+0x2b/0x40
       cpuhp_invoke_callback+0xc9/0xbf0
       cpuhp_thread_fun+0x17b/0x240
       smpboot_thread_fn+0x18a/0x280
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #2 (cpuhp_state-up){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpuhp_issue_call+0x133/0x1c0
       __cpuhp_setup_state_cpuslocked+0x139/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_writeback_init+0x43/0x67
       pagecache_init+0x3d/0x42
       start_kernel+0x3a8/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #1 (cpuhp_state_mutex){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_nested+0x1b/0x20
       __cpuhp_setup_state_cpuslocked+0x53/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_alloc_init+0x28/0x30
       start_kernel+0x145/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #0 (cpu_hotplug_lock.rw_sem){++++}:
       check_prev_add+0x430/0x840
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpus_read_lock+0x3d/0xb0
       stop_machine+0x1c/0x40
       i915_gem_set_wedged+0x1a/0x20 [i915]
       i915_reset+0xb9/0x230 [i915]
       i915_reset_device+0x1f6/0x260 [i915]
       i915_handle_error+0x2d8/0x430 [i915]
       hangcheck_declare_hang+0xd3/0xf0 [i915]
       i915_hangcheck_elapsed+0x262/0x2d0 [i915]
       process_one_work+0x233/0x660
       worker_thread+0x4e/0x3b0
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

other info that might help us debug this:

Chain exists of:
  cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&dev->struct_mutex);
                               lock(&mm->mmap_sem);
                               lock(&dev->struct_mutex);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

3 locks held by kworker/3:4/562:
 #0:  ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
 #1:  ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
 #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]

stack backtrace:
CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3179+ #1
Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
Workqueue: events_long i915_hangcheck_elapsed [i915]
Call Trace:
 dump_stack+0x68/0x9f
 print_circular_bug+0x235/0x3c0
 ? lockdep_init_map_crosslock+0x20/0x20
 check_prev_add+0x430/0x840
 ? irq_work_queue+0x86/0xe0
 ? wake_up_klogd+0x53/0x70
 __lock_acquire+0x1420/0x15e0
 ? __lock_acquire+0x1420/0x15e0
 ? lockdep_init_map_crosslock+0x20/0x20
 lock_acquire+0xb0/0x200
 ? stop_machine+0x1c/0x40
 ? i915_gem_object_truncate+0x50/0x50 [i915]
 cpus_read_lock+0x3d/0xb0
 ? stop_machine+0x1c/0x40
 stop_machine+0x1c/0x40
 i915_gem_set_wedged+0x1a/0x20 [i915]
 i915_reset+0xb9/0x230 [i915]
 i915_reset_device+0x1f6/0x260 [i915]
 ? gen8_gt_irq_ack+0x170/0x170 [i915]
 ? work_on_cpu_safe+0x60/0x60
 i915_handle_error+0x2d8/0x430 [i915]
 ? vsnprintf+0xd1/0x4b0
 ? scnprintf+0x3a/0x70
 hangcheck_declare_hang+0xd3/0xf0 [i915]
 ? intel_runtime_pm_put+0x56/0xa0 [i915]
 i915_hangcheck_elapsed+0x262/0x2d0 [i915]
 process_one_work+0x233/0x660
 worker_thread+0x4e/0x3b0
 kthread+0x152/0x190
 ? process_one_work+0x660/0x660
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x27/0x40
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang

v2: Have 1 global synchronize_rcu() barrier across all engines, and
improve commit message.

v3: We need to protect the seqno update with the timeline spinlock (in
set_wedged) to avoid racing with other updates of the seqno, like we
already do in nop_submit_request (Chris).

v4: Use two-phase sequence to plug the race Chris spotted where we can
complete requests before they're marked up with -EIO.

v5: Review from Chris:
- simplify nop_submit_request.
- Add comment to rcu_read_lock section.
- Align comments with the new style.

v6: Remove unused variable to appease CI.

Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marta Lofstedt <marta.lofstedt@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171011091019.1425-1-daniel.vetter@ffwll.ch
johalun pushed a commit that referenced this pull request Apr 11, 2018
Purely to silence lockdep, as we know that no bo can exist at this time
and so the inversion is impossible. Nevertheless, lockdep currently
warns on unload:

[  137.522565] WARNING: possible circular locking dependency detected
[  137.522568] 4.14.0-rc4-CI-CI_DRM_3209+ #1 Tainted: G     U
[  137.522570] ------------------------------------------------------
[  137.522572] drv_module_relo/1532 is trying to acquire lock:
[  137.522574]  ("i915-userptr-acquire"){+.+.}, at: [<ffffffff8109a831>] flush_workqueue+0x91/0x540
[  137.522581]
               but task is already holding lock:
[  137.522583]  (&dev->struct_mutex){+.+.}, at: [<ffffffffa014fb3f>] i915_gem_fini+0x3f/0xc0 [i915]
[  137.522605]
               which lock already depends on the new lock.

[  137.522608]
               the existing dependency chain (in reverse order) is:
[  137.522611]
               -> #3 (&dev->struct_mutex){+.+.}:
[  137.522615]        __lock_acquire+0x1420/0x15e0
[  137.522618]        lock_acquire+0xb0/0x200
[  137.522621]        __mutex_lock+0x86/0x9b0
[  137.522623]        mutex_lock_interruptible_nested+0x1b/0x20
[  137.522640]        i915_mutex_lock_interruptible+0x51/0x130 [i915]
[  137.522657]        i915_gem_fault+0x20b/0x720 [i915]
[  137.522660]        __do_fault+0x1e/0x80
[  137.522662]        __handle_mm_fault+0xa08/0xed0
[  137.522664]        handle_mm_fault+0x156/0x300
[  137.522666]        __do_page_fault+0x2c5/0x570
[  137.522668]        do_page_fault+0x28/0x250
[  137.522671]        page_fault+0x22/0x30
[  137.522672]
               -> #2 (&mm->mmap_sem){++++}:
[  137.522677]        __lock_acquire+0x1420/0x15e0
[  137.522679]        lock_acquire+0xb0/0x200
[  137.522682]        down_read+0x3e/0x70
[  137.522699]        __i915_gem_userptr_get_pages_worker+0x141/0x240 [i915]
[  137.522701]        process_one_work+0x233/0x660
[  137.522704]        worker_thread+0x4e/0x3b0
[  137.522706]        kthread+0x152/0x190
[  137.522708]        ret_from_fork+0x27/0x40
[  137.522710]
               -> #1 ((&work->work)){+.+.}:
[  137.522714]        __lock_acquire+0x1420/0x15e0
[  137.522717]        lock_acquire+0xb0/0x200
[  137.522719]        process_one_work+0x206/0x660
[  137.522721]        worker_thread+0x4e/0x3b0
[  137.522723]        kthread+0x152/0x190
[  137.522725]        ret_from_fork+0x27/0x40
[  137.522727]
               -> #0 ("i915-userptr-acquire"){+.+.}:
[  137.522731]        check_prev_add+0x430/0x840
[  137.522733]        __lock_acquire+0x1420/0x15e0
[  137.522735]        lock_acquire+0xb0/0x200
[  137.522738]        flush_workqueue+0xb4/0x540
[  137.522740]        drain_workqueue+0xd4/0x1b0
[  137.522742]        destroy_workqueue+0x1c/0x200
[  137.522758]        i915_gem_cleanup_userptr+0x15/0x20 [i915]
[  137.522770]        i915_gem_fini+0x5f/0xc0 [i915]
[  137.522782]        i915_driver_unload+0x122/0x180 [i915]
[  137.522794]        i915_pci_remove+0x19/0x30 [i915]
[  137.522797]        pci_device_remove+0x39/0xb0
[  137.522800]        device_release_driver_internal+0x15d/0x220
[  137.522803]        driver_detach+0x40/0x80
[  137.522805]        bus_remove_driver+0x58/0xd0
[  137.522807]        driver_unregister+0x2c/0x40
[  137.522809]        pci_unregister_driver+0x36/0xb0
[  137.522828]        i915_exit+0x1a/0x8b [i915]
[  137.522831]        SyS_delete_module+0x18c/0x1e0
[  137.522834]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  137.522835]
               other info that might help us debug this:

[  137.522838] Chain exists of:
                 "i915-userptr-acquire" --> &mm->mmap_sem --> &dev->struct_mutex

[  137.522844]  Possible unsafe locking scenario:

[  137.522846]        CPU0                    CPU1
[  137.522848]        ----                    ----
[  137.522850]   lock(&dev->struct_mutex);
[  137.522852]                                lock(&mm->mmap_sem);
[  137.522854]                                lock(&dev->struct_mutex);
[  137.522857]   lock("i915-userptr-acquire");
[  137.522859]
                *** DEADLOCK ***

[  137.522862] 3 locks held by drv_module_relo/1532:
[  137.522864]  #0:  (&dev->mutex){....}, at: [<ffffffff8161d47b>] device_release_driver_internal+0x2b/0x220
[  137.522869]  #1:  (&dev->mutex){....}, at: [<ffffffff8161d489>] device_release_driver_internal+0x39/0x220
[  137.522873]  #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa014fb3f>] i915_gem_fini+0x3f/0xc0 [i915]
[  137.522888]
               stack backtrace:
[  137.522891] CPU: 0 PID: 1532 Comm: drv_module_relo Tainted: G     U          4.14.0-rc4-CI-CI_DRM_3209+ #1
[  137.522894] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
[  137.522897] Call Trace:
[  137.522900]  dump_stack+0x68/0x9f
[  137.522902]  print_circular_bug+0x235/0x3c0
[  137.522905]  ? lockdep_init_map_crosslock+0x20/0x20
[  137.522908]  check_prev_add+0x430/0x840
[  137.522919]  ? i915_gem_fini+0x5f/0xc0 [i915]
[  137.522922]  ? __kernel_text_address+0x12/0x40
[  137.522925]  ? __save_stack_trace+0x66/0xd0
[  137.522928]  __lock_acquire+0x1420/0x15e0
[  137.522930]  ? __lock_acquire+0x1420/0x15e0
[  137.522933]  ? lockdep_init_map_crosslock+0x20/0x20
[  137.522936]  ? __this_cpu_preempt_check+0x13/0x20
[  137.522938]  lock_acquire+0xb0/0x200
[  137.522940]  ? flush_workqueue+0x91/0x540
[  137.522943]  flush_workqueue+0xb4/0x540
[  137.522945]  ? flush_workqueue+0x91/0x540
[  137.522948]  ? __mutex_unlock_slowpath+0x43/0x2c0
[  137.522951]  ? trace_hardirqs_on_caller+0xe3/0x1b0
[  137.522954]  drain_workqueue+0xd4/0x1b0
[  137.522956]  ? drain_workqueue+0xd4/0x1b0
[  137.522958]  destroy_workqueue+0x1c/0x200
[  137.522975]  i915_gem_cleanup_userptr+0x15/0x20 [i915]
[  137.522987]  i915_gem_fini+0x5f/0xc0 [i915]
[  137.523000]  i915_driver_unload+0x122/0x180 [i915]
[  137.523015]  i915_pci_remove+0x19/0x30 [i915]
[  137.523018]  pci_device_remove+0x39/0xb0
[  137.523021]  device_release_driver_internal+0x15d/0x220
[  137.523023]  driver_detach+0x40/0x80
[  137.523026]  bus_remove_driver+0x58/0xd0
[  137.523028]  driver_unregister+0x2c/0x40
[  137.523030]  pci_unregister_driver+0x36/0xb0
[  137.523049]  i915_exit+0x1a/0x8b [i915]
[  137.523052]  SyS_delete_module+0x18c/0x1e0
[  137.523055]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[  137.523057] RIP: 0033:0x7f7bd0609287
[  137.523059] RSP: 002b:00007ffef694bc18 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
[  137.523062] RAX: ffffffffffffffda RBX: ffffffff81493f33 RCX: 00007f7bd0609287
[  137.523065] RDX: 0000000000000001 RSI: 0000000000000800 RDI: 0000564f999f9fc8
[  137.523067] RBP: ffffc90005c4ff88 R08: 0000000000000000 R09: 0000000000000080
[  137.523069] R10: 00007f7bd20ef8c0 R11: 0000000000000246 R12: 0000000000000000
[  137.523072] R13: 00007ffef694be00 R14: 0000000000000000 R15: 0000000000000000
[  137.523075]  ? __this_cpu_preempt_check+0x13/0x20

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20171011141857.14161-1-chris@chris-wilson.co.uk
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
johalun pushed a commit that referenced this pull request Apr 12, 2018
Time to remove less important info and make messages clear
and consistent.

v2: change some message levels (Chris)
v3: restore DRM_WARN (Chris)

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Sagar Arun Kamble <sagar.a.kamble@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> #2
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171016144724.17244-10-michal.wajdeczko@intel.com
johalun pushed a commit that referenced this pull request Apr 15, 2018
We don't need struct_mutex to initialise userptr (it just allocates a
workqueue for itself etc), but we do need struct_mutex later on in
i915_gem_init() in order to feed requests onto the HW.

This should break the chain

[  385.697902] ======================================================
[  385.697907] WARNING: possible circular locking dependency detected
[  385.697913] 4.14.0-CI-Patchwork_7234+ #1 Tainted: G     U
[  385.697917] ------------------------------------------------------
[  385.697922] perf_pmu/2631 is trying to acquire lock:
[  385.697927]  (&mm->mmap_sem){++++}, at: [<ffffffff811bfe1e>] __might_fault+0x3e/0x90
[  385.697941]
               but task is already holding lock:
[  385.697946]  (&cpuctx_mutex){+.+.}, at: [<ffffffff8116fe8c>] perf_event_ctx_lock_nested+0xbc/0x1d0
[  385.697957]
               which lock already depends on the new lock.

[  385.697963]
               the existing dependency chain (in reverse order) is:
[  385.697970]
               -> #4 (&cpuctx_mutex){+.+.}:
[  385.697980]        __mutex_lock+0x86/0x9b0
[  385.697985]        perf_event_init_cpu+0x5a/0x90
[  385.697991]        perf_event_init+0x178/0x1a4
[  385.697997]        start_kernel+0x27f/0x3f1
[  385.698003]        verify_cpu+0x0/0xfb
[  385.698006]
               -> #3 (pmus_lock){+.+.}:
[  385.698015]        __mutex_lock+0x86/0x9b0
[  385.698020]        perf_event_init_cpu+0x21/0x90
[  385.698025]        cpuhp_invoke_callback+0xca/0xc00
[  385.698030]        _cpu_up+0xa7/0x170
[  385.698035]        do_cpu_up+0x57/0x70
[  385.698039]        smp_init+0x62/0xa6
[  385.698044]        kernel_init_freeable+0x97/0x193
[  385.698050]        kernel_init+0xa/0x100
[  385.698055]        ret_from_fork+0x27/0x40
[  385.698058]
               -> #2 (cpu_hotplug_lock.rw_sem){++++}:
[  385.698068]        cpus_read_lock+0x39/0xa0
[  385.698073]        apply_workqueue_attrs+0x12/0x50
[  385.698078]        __alloc_workqueue_key+0x1d8/0x4d8
[  385.698134]        i915_gem_init_userptr+0x5f/0x80 [i915]
[  385.698176]        i915_gem_init+0x7c/0x390 [i915]
[  385.698213]        i915_driver_load+0x99e/0x15c0 [i915]
[  385.698250]        i915_pci_probe+0x33/0x90 [i915]
[  385.698256]        pci_device_probe+0xa1/0x130
[  385.698262]        driver_probe_device+0x293/0x440
[  385.698267]        __driver_attach+0xde/0xe0
[  385.698272]        bus_for_each_dev+0x5c/0x90
[  385.698277]        bus_add_driver+0x16d/0x260
[  385.698282]        driver_register+0x57/0xc0
[  385.698287]        do_one_initcall+0x3e/0x160
[  385.698292]        do_init_module+0x5b/0x1fa
[  385.698297]        load_module+0x2374/0x2dc0
[  385.698302]        SyS_finit_module+0xaa/0xe0
[  385.698307]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  385.698311]
               -> #1 (&dev->struct_mutex){+.+.}:
[  385.698320]        __mutex_lock+0x86/0x9b0
[  385.698361]        i915_mutex_lock_interruptible+0x4c/0x130 [i915]
[  385.698403]        i915_gem_fault+0x206/0x760 [i915]
[  385.698409]        __do_fault+0x1a/0x70
[  385.698413]        __handle_mm_fault+0x7c4/0xdb0
[  385.698417]        handle_mm_fault+0x154/0x300
[  385.698440]        __do_page_fault+0x2d6/0x570
[  385.698445]        page_fault+0x22/0x30
[  385.698449]
               -> #0 (&mm->mmap_sem){++++}:
[  385.698459]        lock_acquire+0xaf/0x200
[  385.698464]        __might_fault+0x68/0x90
[  385.698470]        _copy_to_user+0x1e/0x70
[  385.698475]        perf_read+0x1aa/0x290
[  385.698480]        __vfs_read+0x23/0x120
[  385.698484]        vfs_read+0xa3/0x150
[  385.698488]        SyS_read+0x45/0xb0
[  385.698493]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  385.698497]
               other info that might help us debug this:

[  385.698505] Chain exists of:
                 &mm->mmap_sem --> pmus_lock --> &cpuctx_mutex

[  385.698517]  Possible unsafe locking scenario:

[  385.698522]        CPU0                    CPU1
[  385.698526]        ----                    ----
[  385.698529]   lock(&cpuctx_mutex);
[  385.698553]                                lock(pmus_lock);
[  385.698558]                                lock(&cpuctx_mutex);
[  385.698564]   lock(&mm->mmap_sem);
[  385.698568]
                *** DEADLOCK ***

[  385.698574] 1 lock held by perf_pmu/2631:
[  385.698578]  #0:  (&cpuctx_mutex){+.+.}, at: [<ffffffff8116fe8c>] perf_event_ctx_lock_nested+0xbc/0x1d0
[  385.698589]
               stack backtrace:
[  385.698595] CPU: 3 PID: 2631 Comm: perf_pmu Tainted: G     U          4.14.0-CI-Patchwork_7234+ #1
[  385.698602] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
[  385.698609] Call Trace:
[  385.698615]  dump_stack+0x5f/0x86
[  385.698621]  print_circular_bug.isra.18+0x1d0/0x2c0
[  385.698627]  __lock_acquire+0x19c3/0x1b60
[  385.698634]  ? generic_exec_single+0x77/0xe0
[  385.698640]  ? lock_acquire+0xaf/0x200
[  385.698644]  lock_acquire+0xaf/0x200
[  385.698650]  ? __might_fault+0x3e/0x90
[  385.698655]  __might_fault+0x68/0x90
[  385.698660]  ? __might_fault+0x3e/0x90
[  385.698665]  _copy_to_user+0x1e/0x70
[  385.698670]  perf_read+0x1aa/0x290
[  385.698675]  __vfs_read+0x23/0x120
[  385.698682]  ? __fget+0x101/0x1f0
[  385.698686]  vfs_read+0xa3/0x150
[  385.698691]  SyS_read+0x45/0xb0
[  385.698696]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[  385.698701] RIP: 0033:0x7ff1c46876ed
[  385.698705] RSP: 002b:00007fff13552f90 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
[  385.698712] RAX: ffffffffffffffda RBX: ffffc90000647ff0 RCX: 00007ff1c46876ed
[  385.698718] RDX: 0000000000000010 RSI: 00007fff13552fa0 RDI: 0000000000000005
[  385.698723] RBP: 000056063d300580 R08: 0000000000000000 R09: 0000000000000060
[  385.698729] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000046
[  385.698734] R13: 00007fff13552c6f R14: 00007ff1c6279d00 R15: 00007ff1c6279a40

Testcase: igt/perf_pmu
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20171122172621.16158-1-chris@chris-wilson.co.uk
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
(cherry picked from commit ee48700dd57d9ce783ec40f035b324d0b75632e4)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
johalun pushed a commit that referenced this pull request Aug 4, 2018
This gets rid of the following lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
4.15.0-rc2-CI-Patchwork_7428+ #1 Not tainted
------------------------------------------------------
debugfs_test/1351 is trying to acquire lock:
 (&dev->struct_mutex){+.+.}, at: [<000000009d90d1a3>] i915_mutex_lock_interruptible+0x47/0x130 [i915]

but task is already holding lock:
 (&mm->mmap_sem){++++}, at: [<000000005df01c1e>] __do_page_fault+0x106/0x560

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #6 (&mm->mmap_sem){++++}:
       __might_fault+0x63/0x90
       _copy_to_user+0x1e/0x70
       filldir+0x8c/0xf0
       dcache_readdir+0xeb/0x160
       iterate_dir+0xe6/0x150
       SyS_getdents+0xa0/0x130
       entry_SYSCALL_64_fastpath+0x1c/0x89

-> #5 (&sb->s_type->i_mutex_key#5){++++}:
       lockref_get+0x9/0x20

-> #4 ((completion)&req.done){+.+.}:
       wait_for_common+0x54/0x210
       devtmpfs_create_node+0x130/0x150
       device_add+0x5ad/0x5e0
       device_create_groups_vargs+0xd4/0xe0
       device_create+0x35/0x40
       msr_device_create+0x22/0x40
       cpuhp_invoke_callback+0xc5/0xbf0
       cpuhp_thread_fun+0x167/0x210
       smpboot_thread_fn+0x17f/0x270
       kthread+0x173/0x1b0
       ret_from_fork+0x24/0x30

-> #3 (cpuhp_state-up){+.+.}:
       cpuhp_issue_call+0x132/0x1c0
       __cpuhp_setup_state_cpuslocked+0x12f/0x2a0
       __cpuhp_setup_state+0x3a/0x50
       page_writeback_init+0x3a/0x5c
       start_kernel+0x393/0x3e2
       secondary_startup_64+0xa5/0xb0

-> #2 (cpuhp_state_mutex){+.+.}:
       __mutex_lock+0x81/0x9b0
       __cpuhp_setup_state_cpuslocked+0x4b/0x2a0
       __cpuhp_setup_state+0x3a/0x50
       page_alloc_init+0x1f/0x26
       start_kernel+0x139/0x3e2
       secondary_startup_64+0xa5/0xb0

-> #1 (cpu_hotplug_lock.rw_sem){++++}:
       cpus_read_lock+0x34/0xa0
       apply_workqueue_attrs+0xd/0x40
       __alloc_workqueue_key+0x2c7/0x4e1
       intel_guc_submission_init+0x10c/0x650 [i915]
       intel_uc_init_hw+0x29e/0x460 [i915]
       i915_gem_init_hw+0xca/0x290 [i915]
       i915_gem_init+0x115/0x3a0 [i915]
       i915_driver_load+0x9a8/0x16c0 [i915]
       i915_pci_probe+0x2e/0x90 [i915]
       pci_device_probe+0x9c/0x120
       driver_probe_device+0x2a3/0x480
       __driver_attach+0xd9/0xe0
       bus_for_each_dev+0x57/0x90
       bus_add_driver+0x168/0x260
       driver_register+0x52/0xc0
       do_one_initcall+0x39/0x150
       do_init_module+0x56/0x1ef
       load_module+0x231c/0x2d70
       SyS_finit_module+0xa5/0xe0
       entry_SYSCALL_64_fastpath+0x1c/0x89

-> #0 (&dev->struct_mutex){+.+.}:
       lock_acquire+0xaf/0x200
       __mutex_lock+0x81/0x9b0
       i915_mutex_lock_interruptible+0x47/0x130 [i915]
       i915_gem_fault+0x201/0x760 [i915]
       __do_fault+0x15/0x70
       __handle_mm_fault+0x85b/0xe40
       handle_mm_fault+0x14f/0x2f0
       __do_page_fault+0x2d1/0x560
       page_fault+0x22/0x30

other info that might help us debug this:

Chain exists of:
  &dev->struct_mutex --> &sb->s_type->i_mutex_key#5 --> &mm->mmap_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_sem);
                               lock(&sb->s_type->i_mutex_key#5);
                               lock(&mm->mmap_sem);
  lock(&dev->struct_mutex);

 *** DEADLOCK ***

1 lock held by debugfs_test/1351:
 #0:  (&mm->mmap_sem){++++}, at: [<000000005df01c1e>] __do_page_fault+0x106/0x560

stack backtrace:
CPU: 2 PID: 1351 Comm: debugfs_test Not tainted 4.15.0-rc2-CI-Patchwork_7428+ #1
Hardware name:                  /NUC6i5SYB, BIOS SYSKLi35.86A.0057.2017.0119.1758 01/19/2017
Call Trace:
 dump_stack+0x5f/0x86
 print_circular_bug+0x230/0x3b0
 check_prev_add+0x439/0x7b0
 ? lockdep_init_map_crosslock+0x20/0x20
 ? unwind_get_return_address+0x16/0x30
 ? __lock_acquire+0x1385/0x15a0
 __lock_acquire+0x1385/0x15a0
 lock_acquire+0xaf/0x200
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 __mutex_lock+0x81/0x9b0
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? __pm_runtime_resume+0x4f/0x80
 i915_gem_fault+0x201/0x760 [i915]
 __do_fault+0x15/0x70
 __handle_mm_fault+0x85b/0xe40
 handle_mm_fault+0x14f/0x2f0
 __do_page_fault+0x2d1/0x560
 page_fault+0x22/0x30
RIP: 0033:0x7f98d6f49116
RSP: 002b:00007ffd6ffc3278 EFLAGS: 00010283
RAX: 00007f98d39a2bc0 RBX: 0000000000000000 RCX: 0000000000001680
RDX: 0000000000001680 RSI: 00007ffd6ffc3400 RDI: 00007f98d39a2bc0
RBP: 00007ffd6ffc33a0 R08: 0000000000000000 R09: 00000000000005a0
R10: 000055e847c2a830 R11: 0000000000000002 R12: 0000000000000001
R13: 000055e847c1d040 R14: 00007ffd6ffc3400 R15: 00007f98d6752ba0

v2: Init preempt_work unconditionally (Chris)
v3: Mention that we need the enable_guc=1 for lockdep splat (Chris)

Testcase: igt/debugfs_test/read_all_entries # with i915.enable_guc=1
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20171213221352.7173-2-michal.winiarski@intel.com
johalun pushed a commit that referenced this pull request Aug 23, 2018
…uct_mutex

This patch fixes lockdep issue due to circular locking dependency of
struct_mutex, i_mutex_key, mmap_sem, relay_channels_mutex.
For GuC log relay channel we create debugfs file that requires i_mutex_key
lock and we are doing that under struct_mutex. So we introduced newer
dependency as:
    &dev->struct_mutex --> &sb->s_type->i_mutex_key#3 --> &mm->mmap_sem
However, there is dependency from mmap_sem to struct_mutex. Hence we
separate the relay create/destroy operation from under struct_mutex.
Also added runtime check of relay buffer status.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

======================================================
WARNING: possible circular locking dependency detected
4.15.0-rc6-CI-Patchwork_7614+ #1 Not tainted
------------------------------------------------------
debugfs_test/1388 is trying to acquire lock:
 (&dev->struct_mutex){+.+.}, at: [<00000000d5e1d915>] i915_mutex_lock_interruptible+0x47/0x130 [i915]

but task is already holding lock:
 (&mm->mmap_sem){++++}, at: [<0000000029a9c131>] __do_page_fault+0x106/0x560

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&mm->mmap_sem){++++}:
       _copy_to_user+0x1e/0x70
       filldir+0x8c/0xf0
       dcache_readdir+0xeb/0x160
       iterate_dir+0xdc/0x140
       SyS_getdents+0xa0/0x130
       entry_SYSCALL_64_fastpath+0x1c/0x89

-> #2 (&sb->s_type->i_mutex_key#3){++++}:
       start_creating+0x59/0x110
       __debugfs_create_file+0x2e/0xe0
       relay_create_buf_file+0x62/0x80
       relay_late_setup_files+0x84/0x250
       guc_log_late_setup+0x4f/0x110 [i915]
       i915_guc_log_register+0x32/0x40 [i915]
       i915_driver_load+0x7b6/0x1720 [i915]
       i915_pci_probe+0x2e/0x90 [i915]
       pci_device_probe+0x9c/0x120
       driver_probe_device+0x2a3/0x480
       __driver_attach+0xd9/0xe0
       bus_for_each_dev+0x57/0x90
       bus_add_driver+0x168/0x260
       driver_register+0x52/0xc0
       do_one_initcall+0x39/0x150
       do_init_module+0x56/0x1ef
       load_module+0x231c/0x2d70
       SyS_finit_module+0xa5/0xe0
       entry_SYSCALL_64_fastpath+0x1c/0x89

-> #1 (relay_channels_mutex){+.+.}:
       relay_open+0x12c/0x2b0
       intel_guc_log_runtime_create+0xab/0x230 [i915]
       intel_guc_init+0x81/0x120 [i915]
       intel_uc_init+0x29/0xa0 [i915]
       i915_gem_init+0x182/0x530 [i915]
       i915_driver_load+0xaa9/0x1720 [i915]
       i915_pci_probe+0x2e/0x90 [i915]
       pci_device_probe+0x9c/0x120
       driver_probe_device+0x2a3/0x480
       __driver_attach+0xd9/0xe0
       bus_for_each_dev+0x57/0x90
       bus_add_driver+0x168/0x260
       driver_register+0x52/0xc0
       do_one_initcall+0x39/0x150
       do_init_module+0x56/0x1ef
       load_module+0x231c/0x2d70
       SyS_finit_module+0xa5/0xe0
       entry_SYSCALL_64_fastpath+0x1c/0x89

-> #0 (&dev->struct_mutex){+.+.}:
       __mutex_lock+0x81/0x9b0
       i915_mutex_lock_interruptible+0x47/0x130 [i915]
       i915_gem_fault+0x201/0x790 [i915]
       __do_fault+0x15/0x70
       __handle_mm_fault+0x677/0xdc0
       handle_mm_fault+0x14f/0x2f0
       __do_page_fault+0x2d1/0x560
       page_fault+0x4c/0x60

other info that might help us debug this:

Chain exists of:
  &dev->struct_mutex --> &sb->s_type->i_mutex_key#3 --> &mm->mmap_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_sem);
                               lock(&sb->s_type->i_mutex_key#3);
                               lock(&mm->mmap_sem);
  lock(&dev->struct_mutex);

 *** DEADLOCK ***

1 lock held by debugfs_test/1388:
 #0:  (&mm->mmap_sem){++++}, at: [<0000000029a9c131>] __do_page_fault+0x106/0x560

stack backtrace:
CPU: 2 PID: 1388 Comm: debugfs_test Not tainted 4.15.0-rc6-CI-Patchwork_7614+ #1
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.10 09/29/2016
Call Trace:
 dump_stack+0x5f/0x86
 print_circular_bug.isra.18+0x1d0/0x2c0
 __lock_acquire+0x14ae/0x1b60
 ? lock_acquire+0xaf/0x200
 lock_acquire+0xaf/0x200
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 __mutex_lock+0x81/0x9b0
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? __pm_runtime_resume+0x4f/0x80
 i915_gem_fault+0x201/0x790 [i915]
 __do_fault+0x15/0x70
 ? _raw_spin_unlock+0x29/0x40
 __handle_mm_fault+0x677/0xdc0
 handle_mm_fault+0x14f/0x2f0
 __do_page_fault+0x2d1/0x560
 ? page_fault+0x36/0x60
 page_fault+0x4c/0x60

v2: Added lock protection to guc->log.runtime.relay_chan (Chris)
    Fixed locking inside guc_flush_logs uncovered by new lockdep.

v3: Locking guc_read_update_log_buffer entirely with relay_lock. (Chris)
    Prepared intel_guc_init_early. Moved relay_lock inside relay_create
    relay_destroy, relay_file_create, guc_read_update_log_buffer. (Michal)
    Removed struct_mutex lock around guc_log_flush and removed usage
    of guc_log_has_relay() from runtime_create path as it needs
    struct_mutex lock.

v4: Handle NULL relay sub buffer pointer earlier in read_update_log_buffer
    (Chris). Fixed comment suffix **/. (Michal)

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104693
Testcase: igt/debugfs_test/read_all_entries # with enable_guc=1 and guc_log_level=1
Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Marta Lofstedt <marta.lofstedt@intel.com>
Cc: Michal Winiarski <michal.winiarski@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/1516808821-3638-3-git-send-email-sagar.a.kamble@intel.com
johalun pushed a commit that referenced this pull request Aug 23, 2018
Although we protect the request itself, we don't lock inside
intel_engine_dump() and so the request maybe retired as we peek into it.
One consequence is that the request->ctx may be freed before we
dereference it, leading to a use-after-free. Replace the hw_id we are
peeking from inside request->ctx with the request->fence.context, with
which we can still track from which context the request originated
(although to tie to HW reports requires a little more legwork, but is
good enough to follow the GEM traces).

[52640.729670] general protection fault: 0000 [#2] SMP
[52640.729694] Dumping ftrace buffer:
[52640.729701]    (ftrace buffer empty)
[52640.729705] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_\
temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel snd_hda_codec snd_hwdep gha\
sh_clmulni_intel snd_hda_core snd_pcm mei_me mei i915 r8169 mii prime_numbers i2c_hid
[52640.729748] CPU: 2 PID: 4335 Comm: gem_exec_schedu Tainted: G     UD W        4.16.0-rc3+ #7
[52640.729759] Hardware name: Acer Aspire E5-575G/Ironman_SK  , BIOS V1.12 08/02/2016
[52640.729803] RIP: 0010:print_request+0x2b/0xb0 [i915]
[52640.729811] RSP: 0018:ffffc90001453c18 EFLAGS: 00010206
[52640.729820] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8801e0292d40 RCX: 0000000000000006
[52640.729829] RDX: ffffc90001453c60 RSI: ffff8801e0292d40 RDI: 0000000000000003
[52640.729838] RBP: ffffc90001453d80 R08: 0000000000000000 R09: 0000000000000001
[52640.729847] R10: ffffc90001453bd0 R11: ffffc90001453c73 R12: ffffc90001453c60
[52640.729856] R13: ffffc90001453d80 R14: ffff8801d5a683c8 R15: ffff8801e0292d40
[52640.729866] FS:  00007f1ee50548c0(0000) GS:ffff8801e8200000(0000) knlGS:0000000000000000
[52640.729876] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[52640.729884] CR2: 00007f1ee5077000 CR3: 00000001d9411004 CR4: 00000000003606e0
[52640.729893] Call Trace:
[52640.729922]  intel_engine_print_registers+0x623/0x890 [i915]
[52640.729948]  intel_engine_dump+0x4a3/0x590 [i915]
[52640.729957]  ? seq_printf+0x3a/0x50
[52640.729977]  i915_engine_info+0xb8/0xe0 [i915]
[52640.729984]  ? drm_mode_gamma_get_ioctl+0xf0/0xf0
[52640.729990]  seq_read+0xd5/0x410
[52640.729997]  full_proxy_read+0x4b/0x70
[52640.730004]  __vfs_read+0x1e/0x120
[52640.730009]  ? do_sys_open+0x134/0x220
[52640.730015]  ? kmem_cache_free+0x174/0x2b0
[52640.730021]  vfs_read+0xa1/0x150
[52640.730026]  SyS_read+0x40/0xa0
[52640.730032]  do_syscall_64+0x65/0x1a0
[52640.730038]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Mika Kuoppala <mika.kuoppala@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180228094732.28462-1-chris@chris-wilson.co.uk
johalun pushed a commit that referenced this pull request Aug 24, 2018
The dw_hdmi_setup_rx_sense exported function should not use struct device
to recover the dw-hdmi context using drvdata, but take struct dw_hdmi
directly like other exported functions.

This caused a regression using Meson DRM on S905X since v4.17-rc1 :

Internal error: Oops: 96000007 [#1] PREEMPT SMP
[...]
CPU: 0 PID: 124 Comm: irq/32-dw_hdmi_ Not tainted 4.17.0-rc7 #2
Hardware name: Libre Technology CC (DT)
[...]
pc : osq_lock+0x54/0x188
lr : __mutex_lock.isra.0+0x74/0x530
[...]
Process irq/32-dw_hdmi_ (pid: 124, stack limit = 0x00000000adf418cb)
Call trace:
  osq_lock+0x54/0x188
  __mutex_lock_slowpath+0x10/0x18
  mutex_lock+0x30/0x38
  __dw_hdmi_setup_rx_sense+0x28/0x98
  dw_hdmi_setup_rx_sense+0x10/0x18
  dw_hdmi_top_thread_irq+0x2c/0x50
  irq_thread_fn+0x28/0x68
  irq_thread+0x10c/0x1a0
  kthread+0x128/0x130
  ret_from_fork+0x10/0x18
 Code: 34000964 d00050a2 51000484 9135c042 (f864d844)
 ---[ end trace 945641e1fbbc07da ]---
 note: irq/32-dw_hdmi_[124] exited with preempt_count 1
 genirq: exiting task "irq/32-dw_hdmi_" (124) is an active IRQ thread (irq 32)

Fixes: eea034af90c6 ("drm/bridge/synopsys: dw-hdmi: don't clobber drvdata")
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Tested-by: Koen Kooi <koen@dominion.thruhere.net>
Signed-off-by: Sean Paul <seanpaul@chromium.org>
Link: https://patchwork.freedesktop.org/patch/msgid/1527673438-20643-1-git-send-email-narmstrong@baylibre.com
johalun pushed a commit that referenced this pull request Nov 20, 2018
Instead of returning small data in response status dword,
GuC may append longer data as response message payload.
If caller provides response buffer, we will copy received
data and use number of received data dwords as new success
return value. We will WARN if response from GuC does not
match caller expectation.

v2: fix timeout and checkpatch warnings (Michal)
v3: fix checkpatch again (Michel)
    update wait function name (Michal)
    no need for spinlock_irqsave (MichalWi)
    no magic numbers (MichalWi)
    must check before use (Jani)
    add some more documentation (Michal)
v4: update documentation (Michal)

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Oscar Mateo <oscar.mateo@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Michel Thierry <michel.thierry@intel.com> #2.5
Cc: Michal Winiarski <michal.winiarski@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20180327121439.70096-1-michal.wajdeczko@intel.com
johalun pushed a commit that referenced this pull request Nov 20, 2018
If we skip the intel_prepare_reset(), we should also skip the
intel_display_reset(). If we we use a flag set by intel_prepare_reset()
then we do not have to second guess based on external user controlled
state whether or not the prepare was called before deciding to finish
it after the reset. igt/gem_eio is one such example that may tweak
i915.reset faster than the code is expecting, leading to

[  190.233528] =====================================
[  190.233534] WARNING: bad unlock balance detected!
[  190.233540] 4.16.0-rc7-g335ef9849310-drmtip_10+ #1 Tainted: G     U
[  190.233547] -------------------------------------
[  190.233553] gem_eio/1348 is trying to release lock (crtc_ww_class_acquire) at:
[  190.233569] [<ffffffff895c7810>] drm_modeset_acquire_fini+0x0/0x60
[  190.233575] but there are no more locks to release!
[  190.233580]
               other info that might help us debug this:
[  190.233588] 3 locks held by gem_eio/1348:
[  190.233592]  #0:  (&f->f_pos_lock){+.+.}, at: [<00000000ab90c784>] __fdget_pos+0x3a/0x50
[  190.233607]  #1:  (sb_writers#11){.+.+}, at: [<00000000e1529265>] vfs_write+0x188/0x1a0
[  190.233622]  #2:  (&attr->mutex){+.+.}, at: [<0000000011f40afe>] simple_attr_write+0x36/0xd0
[  190.233635]
               stack backtrace:
[  190.233644] CPU: 0 PID: 1348 Comm: gem_eio Tainted: G     U           4.16.0-rc7-g335ef9849310-drmtip_10+ #1
[  190.233655] Hardware name: Dell Inc.                 OptiPlex GX280               /0G8310, BIOS A04 02/09/2005
[  190.233664] Call Trace:
[  190.233674]  dump_stack+0x67/0x95
[  190.233682]  ? drm_modeset_backoff+0x1b0/0x1b0
[  190.233690]  print_unlock_imbalance_bug+0xd2/0xe0
[  190.233698]  ? drm_modeset_backoff+0x1b0/0x1b0
[  190.233704]  lock_release+0x23e/0x300
[  190.233712]  drm_modeset_acquire_fini+0x16/0x60
[  190.233835]  intel_finish_reset+0x72/0x160 [i915]
[  190.233894]  i915_reset_device+0x1e9/0x240 [i915]
[  190.233953]  ? __intel_get_crtc_scanline+0x1c0/0x1c0 [i915]
[  190.233962]  ? work_on_cpu_safe+0x50/0x50
[  190.234020]  i915_handle_error+0x1f2/0x470 [i915]
[  190.234031]  ? __might_fault+0x39/0x90
[  190.234037]  ? __might_fault+0x39/0x90
[  190.234099]  i915_wedged_set+0x7f/0xc0 [i915]
[  190.234107]  simple_attr_write+0xb0/0xd0
[  190.234117]  full_proxy_write+0x51/0x80
[  190.234125]  __vfs_write+0x21/0x140
[  190.234133]  ? rcu_read_lock_sched_held+0x6f/0x80
[  190.234140]  ? rcu_sync_lockdep_assert+0x29/0x50
[  190.234147]  ? __sb_start_write+0x152/0x1f0
[  190.234152]  ? __sb_start_write+0x168/0x1f0
[  190.234159]  vfs_write+0xbd/0x1a0
[  190.234166]  SyS_write+0x40/0xa0
[  190.234173]  ? do_syscall_64+0x19/0x1b0
[  190.234180]  do_syscall_64+0x6b/0x1b0
[  190.234188]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  190.234196] RIP: 0033:0x7f84c1b392b7
[  190.234201] RSP: 002b:00007f84b6755b00 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[  190.234211] RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007f84c1b392b7
[  190.234218] RDX: 0000000000000002 RSI: 000055ec20abc8d6 RDI: 0000000000000046
[  190.234225] RBP: 000055ec20abc8d6 R08: 0000000000000000 R09: 0000000000000000
[  190.234231] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000002
[  190.234238] R13: 0000000000000000 R14: 00007f84b0000b20 R15: 000055ec20ce4eb8

Testcase: igt/gem_eio
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180405123714.3638-1-chris@chris-wilson.co.uk
johalun pushed a commit that referenced this pull request Nov 21, 2018
CI noticed

<4>[   23.430701] ============================================
<4>[   23.430706] WARNING: possible recursive locking detected
<4>[   23.430713] 4.17.0-rc4-CI-CI_DRM_4156+ #1 Not tainted
<4>[   23.430720] --------------------------------------------
<4>[   23.430725] systemd-udevd/169 is trying to acquire lock:
<4>[   23.430732]         (ptrval) (&(&timeline->lock)->rlock){....}, at: move_to_timeline+0x48/0x12c [i915]
<4>[   23.430888]
                  but task is already holding lock:
<4>[   23.430894]         (ptrval) (&(&timeline->lock)->rlock){....}, at: i915_request_submit+0x1a/0x40 [i915]
<4>[   23.430995]
                  other info that might help us debug this:
<4>[   23.431002]  Possible unsafe locking scenario:

<4>[   23.431007]        CPU0
<4>[   23.431010]        ----
<4>[   23.431013]   lock(&(&timeline->lock)->rlock);
<4>[   23.431021]   lock(&(&timeline->lock)->rlock);
<4>[   23.431028]
                   *** DEADLOCK ***

<4>[   23.431036]  May be due to missing lock nesting notation

<4>[   23.431044] 5 locks held by systemd-udevd/169:
<4>[   23.431049]  #0:         (ptrval) (&dev->mutex){....}, at: __driver_attach+0x42/0xe0
<4>[   23.431065]  #1:         (ptrval) (&dev->mutex){....}, at: __driver_attach+0x50/0xe0
<4>[   23.431078]  #2:         (ptrval) (&dev->struct_mutex){+.+.}, at: i915_gem_init+0xca/0x630 [i915]
<4>[   23.431174]  #3:         (ptrval) (rcu_read_lock){....}, at: submit_notify+0x35/0x124 [i915]
<4>[   23.431271]  #4:         (ptrval) (&(&timeline->lock)->rlock){....}, at: i915_request_submit+0x1a/0x40 [i915]
<4>[   23.431369]
                  stack backtrace:
<4>[   23.431377] CPU: 0 PID: 169 Comm: systemd-udevd Not tainted 4.17.0-rc4-CI-CI_DRM_4156+ #1
<4>[   23.431385] Hardware name: Dell Inc.                 OptiPlex GX280               /0G8310, BIOS A04 02/09/2005
<4>[   23.431394] Call Trace:
<4>[   23.431403]  dump_stack+0x67/0x9b
<4>[   23.431411]  __lock_acquire+0xc67/0x1b50
<4>[   23.431421]  ? ring_buffer_lock_reserve+0x154/0x3f0
<4>[   23.431429]  ? lock_acquire+0xa6/0x210
<4>[   23.431435]  lock_acquire+0xa6/0x210
<4>[   23.431530]  ? move_to_timeline+0x48/0x12c [i915]
<4>[   23.431540]  _raw_spin_lock+0x2a/0x40
<4>[   23.431634]  ? move_to_timeline+0x48/0x12c [i915]
<4>[   23.431730]  move_to_timeline+0x48/0x12c [i915]
<4>[   23.431826]  __i915_request_submit+0xfa/0x280 [i915]
<4>[   23.431923]  i915_request_submit+0x25/0x40 [i915]
<4>[   23.432024]  i9xx_submit_request+0x11/0x140 [i915]
<4>[   23.432120]  submit_notify+0x8d/0x124 [i915]
<4>[   23.432202]  __i915_sw_fence_complete+0x81/0x250 [i915]
<4>[   23.432300]  __i915_request_add+0x31c/0x7c0 [i915]
<4>[   23.432395]  i915_gem_init+0x621/0x630 [i915]
<4>[   23.432476]  i915_driver_load+0xbee/0x10b0 [i915]
<4>[   23.432485]  ? trace_hardirqs_on_caller+0xe0/0x1b0
<4>[   23.432566]  i915_pci_probe+0x29/0x90 [i915]
<4>[   23.432574]  pci_device_probe+0xa1/0x130
<4>[   23.432582]  driver_probe_device+0x306/0x480
<4>[   23.432589]  __driver_attach+0xb7/0xe0
<4>[   23.432596]  ? driver_probe_device+0x480/0x480
<4>[   23.432602]  ? driver_probe_device+0x480/0x480
<4>[   23.432609]  bus_for_each_dev+0x74/0xc0
<4>[   23.432616]  bus_add_driver+0x15f/0x250
<4>[   23.432623]  ? 0xffffffffa02d7000
<4>[   23.432629]  driver_register+0x52/0xc0
<4>[   23.432635]  ? 0xffffffffa02d7000
<4>[   23.432642]  do_one_initcall+0x58/0x370
<4>[   23.432653]  ? do_init_module+0x1d/0x1ea
<4>[   23.432660]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[   23.432667]  ? kmem_cache_alloc_trace+0x282/0x2e0
<4>[   23.432675]  do_init_module+0x56/0x1ea
<4>[   23.432682]  load_module+0x2435/0x2b20
<4>[   23.432694]  ? __se_sys_finit_module+0xd3/0xf0
<4>[   23.432701]  __se_sys_finit_module+0xd3/0xf0
<4>[   23.432710]  do_syscall_64+0x55/0x190
<4>[   23.432717]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[   23.432724] RIP: 0033:0x7fa780782839
<4>[   23.432729] RSP: 002b:00007ffcea73e668 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4>[   23.432738] RAX: ffffffffffffffda RBX: 0000561a472a4b30 RCX: 00007fa780782839
<4>[   23.432745] RDX: 0000000000000000 RSI: 00007fa7804610e5 RDI: 000000000000000e
<4>[   23.432752] RBP: 00007fa7804610e5 R08: 0000000000000000 R09: 00007ffcea73e780
<4>[   23.432758] R10: 000000000000000e R11: 0000000000000246 R12: 0000000000000000
<4>[   23.432765] R13: 0000561a47296450 R14: 0000000000020000 R15: 0000561a472a4b30

but did not report it as an issue as it only occurred during the first
module on boot. This is due to the removal of the distinct global
timeline, and its separate lock class. So instead mark up the expected
nesting. An alternative would be to define a separate lock class for the
engine, but since we only expect to have a single point of nesting, we
can avoid having multiple lock classes for the struct.

Fixes: a89d1f921c15 ("drm/i915: Split i915_gem_timeline into individual timelines")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Tested-by: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180508153514.20251-1-chris@chris-wilson.co.uk
johalun pushed a commit that referenced this pull request Nov 24, 2018
We're fetching GuC/HuC firmwares directly from uc level during
init_early stage but this breaks guc/huc struct isolation and
also strict SW-only initialization rule for init_early. Move fw
fetching to init phase and do it separately per guc/huc struct.

v2: don't forget to move wopcm_init - Michele
v3: fetch in init_misc phase - Michal

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Michel Thierry <michel.thierry@intel.com> #2
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20180628141522.62788-2-michal.wajdeczko@intel.com
johalun pushed a commit that referenced this pull request Nov 26, 2018
With the new CSB processing code, we are not vulnerable to delayed
delivery of a pre-reset interrupt as we use the CSB status pointers in
the HWSP to decide if we need to parse any CSB events and no longer need
to wait for the first post-reset interrupt to be assured that the CSB
mmio registers are valid.

The new icl code to clear registers has a nasty lock inversion:
[   57.409776] ======================================================
[   57.409779] WARNING: possible circular locking dependency detected
[   57.409783] 4.18.0-rc4-CI-CI_DII_1137+ #1 Tainted: G     U  W
[   57.409785] ------------------------------------------------------
[   57.409788] swapper/6/0 is trying to acquire lock:
[   57.409790] 000000004f304ee5 (&engine->timeline.lock/1){-.-.}, at: execlists_submit_request+0x2b/0x1a0 [i915]
[   57.409841]
               but task is already holding lock:
[   57.409844] 00000000aad89594 (&(&rq->lock)->rlock#2){-.-.}, at: notify_ring+0x2b2/0x480 [i915]
[   57.409869]
               which lock already depends on the new lock.

[   57.409872]
               the existing dependency chain (in reverse order) is:
[   57.409876]
               -> #2 (&(&rq->lock)->rlock#2){-.-.}:
[   57.409900]        notify_ring+0x2b2/0x480 [i915]
[   57.409922]        gen8_cs_irq_handler+0x39/0xa0 [i915]
[   57.409943]        gen11_irq_handler+0x2f0/0x420 [i915]
[   57.409949]        __handle_irq_event_percpu+0x42/0x370
[   57.409952]        handle_irq_event_percpu+0x2b/0x70
[   57.409956]        handle_irq_event+0x2f/0x50
[   57.409959]        handle_edge_irq+0xe7/0x190
[   57.409964]        handle_irq+0x67/0x160
[   57.409967]        do_IRQ+0x5e/0x120
[   57.409971]        ret_from_intr+0x0/0x1d
[   57.409974]        _raw_spin_unlock_irqrestore+0x4e/0x60
[   57.409979]        tasklet_action_common.isra.5+0x47/0xb0
[   57.409982]        __do_softirq+0xd9/0x505
[   57.409985]        irq_exit+0xa9/0xc0
[   57.409988]        do_IRQ+0x9a/0x120
[   57.409991]        ret_from_intr+0x0/0x1d
[   57.409995]        cpuidle_enter_state+0xac/0x360
[   57.409999]        do_idle+0x1f3/0x250
[   57.410004]        cpu_startup_entry+0x6a/0x70
[   57.410010]        start_secondary+0x19d/0x1f0
[   57.410015]        secondary_startup_64+0xa5/0xb0
[   57.410018]
               -> #1 (&(&dev_priv->irq_lock)->rlock){-.-.}:
[   57.410081]        clear_gtiir+0x30/0x200 [i915]
[   57.410116]        execlists_reset+0x6e/0x2b0 [i915]
[   57.410140]        i915_reset_engine+0x111/0x190 [i915]
[   57.410165]        i915_handle_error+0x11a/0x4a0 [i915]
[   57.410198]        i915_hangcheck_elapsed+0x378/0x530 [i915]
[   57.410204]        process_one_work+0x248/0x6c0
[   57.410207]        worker_thread+0x37/0x380
[   57.410211]        kthread+0x119/0x130
[   57.410215]        ret_from_fork+0x3a/0x50
[   57.410217]
               -> #0 (&engine->timeline.lock/1){-.-.}:
[   57.410224]        _raw_spin_lock_irqsave+0x33/0x50
[   57.410256]        execlists_submit_request+0x2b/0x1a0 [i915]
[   57.410289]        submit_notify+0x8d/0x124 [i915]
[   57.410314]        __i915_sw_fence_complete+0x81/0x250 [i915]
[   57.410339]        dma_i915_sw_fence_wake+0xd/0x20 [i915]
[   57.410344]        dma_fence_signal_locked+0x79/0x200
[   57.410368]        notify_ring+0x2ba/0x480 [i915]
[   57.410392]        gen8_cs_irq_handler+0x39/0xa0 [i915]
[   57.410416]        gen11_irq_handler+0x2f0/0x420 [i915]
[   57.410421]        __handle_irq_event_percpu+0x42/0x370
[   57.410425]        handle_irq_event_percpu+0x2b/0x70
[   57.410428]        handle_irq_event+0x2f/0x50
[   57.410432]        handle_edge_irq+0xe7/0x190
[   57.410436]        handle_irq+0x67/0x160
[   57.410439]        do_IRQ+0x5e/0x120
[   57.410445]        ret_from_intr+0x0/0x1d
[   57.410449]        cpuidle_enter_state+0xac/0x360
[   57.410453]        do_idle+0x1f3/0x250
[   57.410456]        cpu_startup_entry+0x6a/0x70
[   57.410460]        start_secondary+0x19d/0x1f0
[   57.410464]        secondary_startup_64+0xa5/0xb0
[   57.410466]
               other info that might help us debug this:

[   57.410471] Chain exists of:
                 &engine->timeline.lock/1 --> &(&dev_priv->irq_lock)->rlock --> &(&rq->lock)->rlock#2

[   57.410481]  Possible unsafe locking scenario:

[   57.410485]        CPU0                    CPU1
[   57.410487]        ----                    ----
[   57.410490]   lock(&(&rq->lock)->rlock#2);
[   57.410494]                                lock(&(&dev_priv->irq_lock)->rlock);
[   57.410498]                                lock(&(&rq->lock)->rlock#2);
[   57.410503]   lock(&engine->timeline.lock/1);
[   57.410506]
                *** DEADLOCK ***

[   57.410511] 4 locks held by swapper/6/0:
[   57.410514]  #0: 0000000074575789 (&(&dev_priv->irq_lock)->rlock){-.-.}, at: gen11_irq_handler+0x8a/0x420 [i915]
[   57.410542]  #1: 000000009b29b30e (rcu_read_lock){....}, at: notify_ring+0x1a/0x480 [i915]
[   57.410573]  #2: 00000000aad89594 (&(&rq->lock)->rlock#2){-.-.}, at: notify_ring+0x2b2/0x480 [i915]
[   57.410601]  #3: 000000009b29b30e (rcu_read_lock){....}, at: submit_notify+0x35/0x124 [i915]
[   57.410635]
               stack backtrace:
[   57.410640] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G     U  W         4.18.0-rc4-CI-CI_DII_1137+ #1
[   57.410644] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2222.A01.1805300339 05/30/2018
[   57.410650] Call Trace:
[   57.410652]  <IRQ>
[   57.410657]  dump_stack+0x67/0x9b
[   57.410662]  print_circular_bug.isra.16+0x1c8/0x2b0
[   57.410666]  __lock_acquire+0x1897/0x1b50
[   57.410671]  ? lock_acquire+0xa6/0x210
[   57.410674]  lock_acquire+0xa6/0x210
[   57.410706]  ? execlists_submit_request+0x2b/0x1a0 [i915]
[   57.410711]  _raw_spin_lock_irqsave+0x33/0x50
[   57.410741]  ? execlists_submit_request+0x2b/0x1a0 [i915]
[   57.410769]  execlists_submit_request+0x2b/0x1a0 [i915]
[   57.410774]  ? _raw_spin_unlock_irqrestore+0x39/0x60
[   57.410804]  submit_notify+0x8d/0x124 [i915]
[   57.410828]  __i915_sw_fence_complete+0x81/0x250 [i915]
[   57.410854]  dma_i915_sw_fence_wake+0xd/0x20 [i915]
[   57.410858]  dma_fence_signal_locked+0x79/0x200
[   57.410882]  notify_ring+0x2ba/0x480 [i915]
[   57.410907]  gen8_cs_irq_handler+0x39/0xa0 [i915]
[   57.410933]  gen11_irq_handler+0x2f0/0x420 [i915]
[   57.410938]  __handle_irq_event_percpu+0x42/0x370
[   57.410943]  handle_irq_event_percpu+0x2b/0x70
[   57.410947]  handle_irq_event+0x2f/0x50
[   57.410951]  handle_edge_irq+0xe7/0x190
[   57.410955]  handle_irq+0x67/0x160
[   57.410958]  do_IRQ+0x5e/0x120
[   57.410962]  common_interrupt+0xf/0xf
[   57.410965]  </IRQ>
[   57.410969] RIP: 0010:cpuidle_enter_state+0xac/0x360
[   57.410972] Code: 44 00 00 31 ff e8 84 93 91 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 31 02 00 00 31 ff e8 7d 30 98 ff e8 e8 0e 94 ff fb 4c 29 fb <48> ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
[   57.411015] RSP: 0018:ffffc90000133e90 EFLAGS: 00000216 ORIG_RAX: ffffffffffffffdd
[   57.411023] RAX: ffff8804ae748040 RBX: 000000000002a97d RCX: 0000000000000000
[   57.411029] RDX: 0000000000000046 RSI: ffffffff82141263 RDI: ffffffff820f05a7
[   57.411035] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[   57.411041] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8229f078
[   57.411045] R13: ffff8804ab2adfa8 R14: 0000000000000000 R15: 0000000d5de092e3
[   57.411052]  do_idle+0x1f3/0x250
[   57.411055]  cpu_startup_entry+0x6a/0x70
[   57.411059]  start_secondary+0x19d/0x1f0
[   57.411064]  secondary_startup_64+0xa5/0xb0

The easiest remedy is to remove the defunct code.

Fixes: ff047a87cfac ("drm/i915/icl: Correctly clear lost ctx-switch interrupts across reset for Gen11")
References: fd8526e50902 ("drm/i915/execlists: Trust the CSB")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Oscar Mateo <oscar.mateo@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Michel Thierry <michel.thierry@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180713203529.1973-3-chris@chris-wilson.co.uk
johalun pushed a commit that referenced this pull request Dec 8, 2018
[why]
Removing connector reusage from DM to match the rest of the tree ended
up revealing an issue that was surprisingly subtle. The original amdgpu
code for DC that was submitted appears to have left a chunk in
dm_dp_create_fake_mst_encoder() that tries to find a "master encoder",
the likes of which isn't actually used or stored anywhere. It does so at
the wrong time as well by trying to access parts of the drm_connector
from the encoder init before it's actually been initialized. This
results in a NULL pointer deref on MST hotplugs:

[  160.696613] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  160.697234] PGD 0 P4D 0
[  160.697814] Oops: 0010 [#1] SMP PTI
[  160.698430] CPU: 2 PID: 64 Comm: kworker/2:1 Kdump: loaded Tainted: G           O      4.19.0Lyude-Test+ #2
[  160.699020] Hardware name: HP HP ZBook 15 G4/8275, BIOS P70 Ver. 01.22 05/17/2018
[  160.699672] Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
[  160.700322] RIP: 0010:          (null)
[  160.700920] Code: Bad RIP value.
[  160.701541] RSP: 0018:ffffc9000029fc78 EFLAGS: 00010206
[  160.702183] RAX: 0000000000000000 RBX: ffff8804440ed468 RCX: ffff8804440e9158
[  160.702778] RDX: 0000000000000000 RSI: ffff8804556c5700 RDI: ffff8804440ed000
[  160.703408] RBP: ffff880458e21800 R08: 0000000000000002 R09: 000000005fca0a25
[  160.704002] R10: ffff88045a077a3d R11: ffff88045a077a3c R12: ffff8804440ed000
[  160.704614] R13: ffff880458e21800 R14: ffff8804440e9000 R15: ffff8804440e9000
[  160.705260] FS:  0000000000000000(0000) GS:ffff88045f280000(0000) knlGS:0000000000000000
[  160.705854] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  160.706478] CR2: ffffffffffffffd6 CR3: 000000000200a001 CR4: 00000000003606e0
[  160.707124] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  160.707724] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  160.708372] Call Trace:
[  160.708998]  ? dm_dp_add_mst_connector+0xed/0x1d0 [amdgpu]
[  160.709625]  ? drm_dp_add_port+0x2fa/0x470 [drm_kms_helper]
[  160.710284]  ? wake_up_q+0x54/0x70
[  160.710877]  ? __mutex_unlock_slowpath.isra.18+0xb3/0x110
[  160.711512]  ? drm_dp_dpcd_access+0xe7/0x110 [drm_kms_helper]
[  160.712161]  ? drm_dp_send_link_address+0x155/0x1e0 [drm_kms_helper]
[  160.712762]  ? drm_dp_check_and_send_link_address+0xa3/0xd0 [drm_kms_helper]
[  160.713408]  ? drm_dp_mst_link_probe_work+0x4b/0x80 [drm_kms_helper]
[  160.714013]  ? process_one_work+0x1a1/0x3a0
[  160.714667]  ? worker_thread+0x30/0x380
[  160.715326]  ? wq_update_unbound_numa+0x10/0x10
[  160.715939]  ? kthread+0x112/0x130
[  160.716591]  ? kthread_create_worker_on_cpu+0x70/0x70
[  160.717262]  ? ret_from_fork+0x35/0x40
[  160.717886] Modules linked in: amdgpu(O) vfat fat snd_hda_codec_generic joydev i915 chash gpu_sched ttm i2c_algo_bit drm_kms_helper snd_hda_codec_hdmi hp_wmi syscopyarea iTCO_wdt sysfillrect sparse_keymap sysimgblt fb_sys_fops snd_hda_intel usbhid wmi_bmof drm snd_hda_codec btusb snd_hda_core intel_rapl btrtl x86_pkg_temp_thermal btbcm btintel coretemp snd_pcm crc32_pclmul bluetooth psmouse snd_timer snd pcspkr i2c_i801 mei_me i2c_core soundcore mei tpm_tis wmi tpm_tis_core hp_accel ecdh_generic lis3lv02d tpm video rfkill acpi_pad input_polldev hp_wireless pcc_cpufreq crc32c_intel serio_raw tg3 xhci_pci xhci_hcd [last unloaded: amdgpu]
[  160.720141] CR2: 0000000000000000

Somehow the connector reusage DM was using for MST connectors managed to
paper over this issue entirely; hence why this was never caught until
now.

[how]
Since this code isn't used anywhere and seems useless anyway, we can
just drop it entirely. This appears to fix the issue on my HP ZBook with
an AMD WX4150.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
johalun pushed a commit that referenced this pull request Dec 8, 2018
Turns out that if you trigger an HPD storm on a system that has an MST
topology connected to it, you'll end up causing the kernel to eventually
hit a NULL deref:

[  332.339041] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ec
[  332.340906] PGD 0 P4D 0
[  332.342750] Oops: 0000 [#1] SMP PTI
[  332.344579] CPU: 2 PID: 25 Comm: kworker/2:0 Kdump: loaded Tainted: G           O      4.18.0-rc3short-hpd-storm+ #2
[  332.346453] Hardware name: LENOVO 20BWS1KY00/20BWS1KY00, BIOS JBET71WW (1.35 ) 09/14/2018
[  332.348361] Workqueue: events intel_hpd_irq_storm_reenable_work [i915]
[  332.350301] RIP: 0010:intel_hpd_irq_storm_reenable_work.cold.3+0x2f/0x86 [i915]
[  332.352213] Code: 00 00 ba e8 00 00 00 48 c7 c6 c0 aa 5f a0 48 c7 c7 d0 73 62 a0 4c 89 c1 4c 89 04 24 e8 7f f5 af e0 4c 8b 04 24 44 89 f8 29 e8 <41> 39 80 ec 00 00 00 0f 85 43 13 fc ff 41 0f b6 86 b8 04 00 00 41
[  332.354286] RSP: 0018:ffffc90000147e48 EFLAGS: 00010006
[  332.356344] RAX: 0000000000000005 RBX: ffff8802c226c9d4 RCX: 0000000000000006
[  332.358404] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff88032dc95570
[  332.360466] RBP: 0000000000000005 R08: 0000000000000000 R09: ffff88031b3dc840
[  332.362528] R10: 0000000000000000 R11: 000000031a069602 R12: ffff8802c226ca20
[  332.364575] R13: ffff8802c2268000 R14: ffff880310661000 R15: 000000000000000a
[  332.366615] FS:  0000000000000000(0000) GS:ffff88032dc80000(0000) knlGS:0000000000000000
[  332.368658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  332.370690] CR2: 00000000000000ec CR3: 000000000200a003 CR4: 00000000003606e0
[  332.372724] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  332.374773] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  332.376798] Call Trace:
[  332.378809]  process_one_work+0x1a1/0x350
[  332.380806]  worker_thread+0x30/0x380
[  332.382777]  ? wq_update_unbound_numa+0x10/0x10
[  332.384772]  kthread+0x112/0x130
[  332.386740]  ? kthread_create_worker_on_cpu+0x70/0x70
[  332.388706]  ret_from_fork+0x35/0x40
[  332.390651] Modules linked in: i915(O) vfat fat joydev btusb btrtl btbcm btintel bluetooth ecdh_generic iTCO_wdt wmi_bmof i2c_algo_bit drm_kms_helper intel_rapl syscopyarea sysfillrect x86_pkg_temp_thermal sysimgblt coretemp fb_sys_fops crc32_pclmul drm psmouse pcspkr mei_me mei i2c_i801 lpc_ich mfd_core i2c_core tpm_tis tpm_tis_core thinkpad_acpi wmi tpm rfkill video crc32c_intel serio_raw ehci_pci xhci_pci ehci_hcd xhci_hcd [last unloaded: i915]
[  332.394963] CR2: 00000000000000ec

This appears to be due to the fact that with an MST topology, not all
intel_connector structs will have ->encoder set. So, fix this by
skipping connectors without encoders in
intel_hpd_irq_storm_reenable_work().

For those wondering, this bug was found on accident while simulating HPD
storms using a Chamelium connected to a ThinkPad T450s (Broadwell).

Changes since v1:
- Check intel_connector->mst_port instead of intel_connector->encoder

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: stable@vger.kernel.org
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181106213017.14563-3-lyude@redhat.com
(cherry picked from commit fee61deecb1d850bf34f682a6a452e5ee51b7572)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
valpackett pushed a commit to valpackett/kms-drm that referenced this pull request Apr 15, 2020
[ Upstream commit dcd5fb82ffb484124203aa339733663ac0b059f3 ]

Reference counting in amdgpu_dm_connector for amdgpu_dm_connector::dc_sink
and amdgpu_dm_connector::dc_em_sink as well as in dc_link::local_sink seems
to be out of shape. Thus make reference counting consistent for these
members and just plain increment the reference count when the variable
gets assigned and decrement when the pointer is set to zero or replaced.
Also simplify reference counting in selected function sopes to be sure the
reference is released in any case. In some cases add NULL pointer check
before dereferencing.
At a hand full of places a comment is placed to stat that the reference
increment happened already somewhere else.

This actually fixes the following kernel bug on my system when enabling
display core in amdgpu. There are some more similar bug reports around,
so it probably helps at more places.

   kernel BUG at mm/slub.c:294!
   invalid opcode: 0000 [FreeBSDDesktop#1] SMP PTI
   CPU: 9 PID: 1180 Comm: Xorg Not tainted 5.0.0-rc1+ FreeBSDDesktop#2
   Hardware name: Supermicro X10DAi/X10DAI, BIOS 3.0a 02/05/2018
   RIP: 0010:__slab_free+0x1e2/0x3d0
   Code: 8b 54 24 30 48 89 4c 24 28 e8 da fb ff ff 4c 8b 54 24 28 85 c0 0f 85 67 fe ff ff 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 49 3b 5c 24 28 75 ab 48 8b 44 24 30 49 89 4c 24 28 49 89 44
   RSP: 0018:ffffb0978589fa90 EFLAGS: 00010246
   RAX: ffff92f12806c400 RBX: 0000000080200019 RCX: ffff92f12806c400
   RDX: ffff92f12806c400 RSI: ffffdd6421a01a00 RDI: ffff92ed2f406e80
   RBP: ffffb0978589fb40 R08: 0000000000000001 R09: ffffffffc0ee4748
   R10: ffff92f12806c400 R11: 0000000000000001 R12: ffffdd6421a01a00
   R13: ffff92f12806c400 R14: ffff92ed2f406e80 R15: ffffdd6421a01a20
   FS:  00007f4170be0ac0(0000) GS:ffff92ed2fb40000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000562818aaa000 CR3: 000000045745a002 CR4: 00000000003606e0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   Call Trace:
    ? drm_dbg+0x87/0x90 [drm]
    dc_stream_release+0x28/0x50 [amdgpu]
    amdgpu_dm_connector_mode_valid+0xb4/0x1f0 [amdgpu]
    drm_helper_probe_single_connector_modes+0x492/0x6b0 [drm_kms_helper]
    drm_mode_getconnector+0x457/0x490 [drm]
    ? drm_connector_property_set_ioctl+0x60/0x60 [drm]
    drm_ioctl_kernel+0xa9/0xf0 [drm]
    drm_ioctl+0x201/0x3a0 [drm]
    ? drm_connector_property_set_ioctl+0x60/0x60 [drm]
    amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
    do_vfs_ioctl+0xa4/0x630
    ? __sys_recvmsg+0x83/0xa0
    ksys_ioctl+0x60/0x90
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x5b/0x160
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
   RIP: 0033:0x7f417110809b
   Code: 0f 1e fa 48 8b 05 ed bd 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd bd 0c 00 f7 d8 64 89 01 48
   RSP: 002b:00007ffdd8d1c268 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 0000562818a8ebc0 RCX: 00007f417110809b
   RDX: 00007ffdd8d1c2a0 RSI: 00000000c05064a7 RDI: 0000000000000012
   RBP: 00007ffdd8d1c2a0 R08: 0000562819012280 R09: 0000000000000007
   R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c05064a7
   R13: 0000000000000012 R14: 0000000000000012 R15: 00007ffdd8d1c2a0
   Modules linked in: nfsv4 dns_resolver nfs lockd grace fscache fuse vfat fat amdgpu intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul chash gpu_sched crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel amd_iommu_v2 iTCO_wdt iTCO_vendor_support ttm snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio snd_hda_intel drm_kms_helper snd_hda_codec intel_cstate snd_hda_core drm snd_hwdep snd_seq snd_seq_device intel_uncore snd_pcm intel_rapl_perf snd_timer snd soundcore ioatdma pcspkr intel_wmi_thunderbolt mxm_wmi i2c_i801 lpc_ich pcc_cpufreq auth_rpcgss sunrpc igb crc32c_intel i2c_algo_bit dca wmi hid_cherry analog gameport joydev

This patch is based on agd5f/drm-next-5.1-wip. This patch does not require
all of that, but agd5f/drm-next-5.1-wip contains at least one more dc_sink
counting fix that I could spot.

Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>
Reviewed-by: Leo Li <sunpeng.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
amshafer referenced this pull request in amshafer/kms-drm Apr 26, 2020
Reference counting in amdgpu_dm_connector for amdgpu_dm_connector::dc_sink
and amdgpu_dm_connector::dc_em_sink as well as in dc_link::local_sink seems
to be out of shape. Thus make reference counting consistent for these
members and just plain increment the reference count when the variable
gets assigned and decrement when the pointer is set to zero or replaced.
Also simplify reference counting in selected function sopes to be sure the
reference is released in any case. In some cases add NULL pointer check
before dereferencing.
At a hand full of places a comment is placed to stat that the reference
increment happened already somewhere else.

This actually fixes the following kernel bug on my system when enabling
display core in amdgpu. There are some more similar bug reports around,
so it probably helps at more places.

   kernel BUG at mm/slub.c:294!
   invalid opcode: 0000 [#1] SMP PTI
   CPU: 9 PID: 1180 Comm: Xorg Not tainted 5.0.0-rc1+ #2
   Hardware name: Supermicro X10DAi/X10DAI, BIOS 3.0a 02/05/2018
   RIP: 0010:__slab_free+0x1e2/0x3d0
   Code: 8b 54 24 30 48 89 4c 24 28 e8 da fb ff ff 4c 8b 54 24 28 85 c0 0f 85 67 fe ff ff 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 49 3b 5c 24 28 75 ab 48 8b 44 24 30 49 89 4c 24 28 49 89 44
   RSP: 0018:ffffb0978589fa90 EFLAGS: 00010246
   RAX: ffff92f12806c400 RBX: 0000000080200019 RCX: ffff92f12806c400
   RDX: ffff92f12806c400 RSI: ffffdd6421a01a00 RDI: ffff92ed2f406e80
   RBP: ffffb0978589fb40 R08: 0000000000000001 R09: ffffffffc0ee4748
   R10: ffff92f12806c400 R11: 0000000000000001 R12: ffffdd6421a01a00
   R13: ffff92f12806c400 R14: ffff92ed2f406e80 R15: ffffdd6421a01a20
   FS:  00007f4170be0ac0(0000) GS:ffff92ed2fb40000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000562818aaa000 CR3: 000000045745a002 CR4: 00000000003606e0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   Call Trace:
    ? drm_dbg+0x87/0x90 [drm]
    dc_stream_release+0x28/0x50 [amdgpu]
    amdgpu_dm_connector_mode_valid+0xb4/0x1f0 [amdgpu]
    drm_helper_probe_single_connector_modes+0x492/0x6b0 [drm_kms_helper]
    drm_mode_getconnector+0x457/0x490 [drm]
    ? drm_connector_property_set_ioctl+0x60/0x60 [drm]
    drm_ioctl_kernel+0xa9/0xf0 [drm]
    drm_ioctl+0x201/0x3a0 [drm]
    ? drm_connector_property_set_ioctl+0x60/0x60 [drm]
    amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
    do_vfs_ioctl+0xa4/0x630
    ? __sys_recvmsg+0x83/0xa0
    ksys_ioctl+0x60/0x90
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x5b/0x160
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
   RIP: 0033:0x7f417110809b
   Code: 0f 1e fa 48 8b 05 ed bd 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd bd 0c 00 f7 d8 64 89 01 48
   RSP: 002b:00007ffdd8d1c268 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 0000562818a8ebc0 RCX: 00007f417110809b
   RDX: 00007ffdd8d1c2a0 RSI: 00000000c05064a7 RDI: 0000000000000012
   RBP: 00007ffdd8d1c2a0 R08: 0000562819012280 R09: 0000000000000007
   R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c05064a7
   R13: 0000000000000012 R14: 0000000000000012 R15: 00007ffdd8d1c2a0
   Modules linked in: nfsv4 dns_resolver nfs lockd grace fscache fuse vfat fat amdgpu intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul chash gpu_sched crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel amd_iommu_v2 iTCO_wdt iTCO_vendor_support ttm snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio snd_hda_intel drm_kms_helper snd_hda_codec intel_cstate snd_hda_core drm snd_hwdep snd_seq snd_seq_device intel_uncore snd_pcm intel_rapl_perf snd_timer snd soundcore ioatdma pcspkr intel_wmi_thunderbolt mxm_wmi i2c_i801 lpc_ich pcc_cpufreq auth_rpcgss sunrpc igb crc32c_intel i2c_algo_bit dca wmi hid_cherry analog gameport joydev

This patch is based on agd5f/drm-next-5.1-wip. This patch does not require
all of that, but agd5f/drm-next-5.1-wip contains at least one more dc_sink
counting fix that I could spot.

Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>
Reviewed-by: Leo Li <sunpeng.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants