Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMAP3+: SmartReflex: disable errgen before vpbound disable #4

Closed
wants to merge 1 commit into from

Conversation

Docker-J
Copy link

@Docker-J Docker-J commented Feb 8, 2013

@Quarx2k
Copy link
Owner

Quarx2k commented Feb 8, 2013

Thanks, i merged it manually:)

@Quarx2k Quarx2k closed this Feb 8, 2013
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
If the netdev is already in NETREG_UNREGISTERING/_UNREGISTERED state, do not
update the real num tx queues. netdev_queue_update_kobjects() is already
called via remove_queue_kobjects() at NETREG_UNREGISTERING time. So, when
upper layer driver, e.g., FCoE protocol stack is monitoring the netdev
event of NETDEV_UNREGISTER and calls back to LLD ndo_fcoe_disable() to remove
extra queues allocated for FCoE, the associated txq sysfs kobjects are already
removed, and trying to update the real num queues would cause something like
below:

...
PID: 25138  TASK: ffff88021e64c440  CPU: 3   COMMAND: "kworker/3:3"
 #0 [ffff88021f007760] machine_kexec at ffffffff810226d9
 #1 [ffff88021f0077d0] crash_kexec at ffffffff81089d2d
 #2 [ffff88021f0078a0] oops_end at ffffffff813bca78
 #3 [ffff88021f0078d0] no_context at ffffffff81029e72
 #4 [ffff88021f007920] __bad_area_nosemaphore at ffffffff8102a155
 #5 [ffff88021f0079f0] bad_area_nosemaphore at ffffffff8102a23e
 #6 [ffff88021f007a00] do_page_fault at ffffffff813bf32e
 #7 [ffff88021f007b10] page_fault at ffffffff813bc045
    [exception RIP: sysfs_find_dirent+17]
    RIP: ffffffff81178611  RSP: ffff88021f007bc0  RFLAGS: 00010246
    RAX: ffff88021e64c440  RBX: ffffffff8156cc63  RCX: 0000000000000004
    RDX: ffffffff8156cc63  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88021f007be0   R8: 0000000000000004   R9: 0000000000000008
    R10: ffffffff816fed00  R11: 0000000000000004  R12: 0000000000000000
    R13: ffffffff8156cc63  R14: 0000000000000000  R15: ffff8802222a0000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff88021f007be8] sysfs_get_dirent at ffffffff81178c07
 #9 [ffff88021f007c18] sysfs_remove_group at ffffffff8117ac27
#10 [ffff88021f007c48] netdev_queue_update_kobjects at ffffffff813178f9
#11 [ffff88021f007c88] netif_set_real_num_tx_queues at ffffffff81303e38
#12 [ffff88021f007cc8] ixgbe_set_num_queues at ffffffffa0249763 [ixgbe]
#13 [ffff88021f007cf8] ixgbe_init_interrupt_scheme at ffffffffa024ea89 [ixgbe]
#14 [ffff88021f007d48] ixgbe_fcoe_disable at ffffffffa0267113 [ixgbe]
#15 [ffff88021f007d68] vlan_dev_fcoe_disable at ffffffffa014fef5 [8021q]
#16 [ffff88021f007d78] fcoe_interface_cleanup at ffffffffa02b7dfd [fcoe]
#17 [ffff88021f007df8] fcoe_destroy_work at ffffffffa02b7f08 [fcoe]
#18 [ffff88021f007e18] process_one_work at ffffffff8105d7ca
#19 [ffff88021f007e68] worker_thread at ffffffff81060513
#20 [ffff88021f007ee8] kthread at ffffffff810648b6
#21 [ffff88021f007f48] kernel_thread_helper at ffffffff813c40f4

Signed-off-by: Yi Zou <yi.zou@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Tested-by: Stephen Ko <stephen.s.ko@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 #31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 Ok.
Booting Node   1, Processors  #8 #9 #10 #11 #12 #13 #14 #15 Ok.
Booting Node   0, Processors  #16 #17 #18 #19 #20 #21 #22 #23 Ok.
Booting Node   1, Processors  #24 #25 #26 #27 #28 #29 #30 #31
Brought up 32 CPUs

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
Remove individual platform header files for dm365, dm355, dm644x
and dm646x and consolidate it into a single and common
header file davinci.h placed in arch/arm/mach-davinci.

This reduces the pollution in the include/mach and is consistent
with Russell's suggestions as part of his "pet peaves" mail.
(See #4 in: http://lists.infradead.org/pipermail/linux-arm-kernel/2011-November/071516.html)

While at it, fix the forward declaration of spi_board_info,
and include the right header file instead.

The further patches in the series take  advantage of this consolidation
for easy implementation of IO_ADDRESS elimination.

Signed-off-by: Manjunath Hadli <manjunath.hadli@ti.com>
[nsekhar@ti.com: make davinci.h the first local include file,
fix forward declaration of spi_board_info and add back Deep Root
Systems, LLC copyright]
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
The <linux/device.h> header includes a lot of stuff, and
it in turn gets a lot of use just for the basic "struct device"
which appears so often.

Clean up the users as follows:

1) For those headers only needing "struct device" as a pointer
in fcn args, replace the include with exactly that.

2) For headers not really using anything from device.h, simply
delete the include altogether.

3) For headers relying on getting device.h implicitly before
being included themselves, now explicitly include device.h

4) For files in which doing #1 or #2 uncovers an implicit
dependency on some other header, fix by explicitly adding
the required header(s).

Any C files that were implicitly relying on device.h to be
present have already been dealt with in advance.

Total removals from #1 and #2: 51.  Total additions coming
from #3: 9.  Total other implicit dependencies from #4: 7.

As of 3.3-rc1, there were 110, so a net removal of 42 gives
about a 38% reduction in device.h presence in include/*

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
The compiler does not conditionalize the assembly instructions for
the tlb operations, which leads to sub-optimal code being generated
when building a kernel for multiple CPUs.

We can tweak things fairly simply as the code fragment below shows:

    17f8:       e3120001        tst     r2, #1  ; 0x1
...
    1800:       0a000000        beq     1808 <handle_pte_fault+0x194>
    1804:       ee061f10        mcr     15, 0, r1, cr6, cr0, {0}
    1808:       e3120004        tst     r2, #4  ; 0x4
    180c:       0a000000        beq     1814 <handle_pte_fault+0x1a0>
    1810:       ee081f36        mcr     15, 0, r1, cr8, cr6, {1}
becomes:
    17f0:       e3120001        tst     r2, #1  ; 0x1
    17f4:       1e063f10        mcrne   15, 0, r3, cr6, cr0, {0}
    17f8:       e3120004        tst     r2, #4  ; 0x4
    17fc:       1e083f36        mcrne   15, 0, r3, cr8, cr6, {1}

Overall, for Realview with V6 and V7 CPUs configured:

   text    data     bss     dec     hex filename
4153998  207340 5371036 9732374  948116 ../build/realview/vmlinux.before
4153366  207332 5371036 9731734  947e96 ../build/realview/vmlinux.after

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
The ID used for detection of the BenQ R55E actually identifies the
Quanta TW3 ODM design, which is also used for the Gigabyte W551 laptop
series. Schematics on the internet clearly indicate that the "Port C"
(analog input connected to record source #4 and mixer input #4) is
unconnected.

Playing an audio CD through analog playback (using cdplay from cdtools)
produces no sound, even with the mixer input labelled "CD" enabled, and
the volume control in the CD drive set to maximum. This indicates the
connection is really not present.

Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit 2c0c2a0 upstream.

While traversing the linked list of open file handles, if the identfied
file handle is invalid, a reopen is attempted and if it fails, we
resume traversing where we stopped and cifs can oops while accessing
invalid next element, for list might have changed.

So mark the invalid file handle and attempt reopen if no
valid file handle is found in rest of the list.
If reopen fails, move the invalid file handle to the end of the list
and start traversing the list again from the begining.
Repeat this four times before giving up and returning an error if
file reopen keeps failing.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit e5851da upstream.

Remove spinlock as atomic_t can be used instead. Note we use only 16
lower bits, upper bits are changed but we impilcilty cast to u16.

This fix possible deadlock on IBSS mode reproted by lockdep:

=================================
[ INFO: inconsistent lock state ]
3.4.0-wl+ #4 Not tainted
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kworker/u:2/30374 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&intf->seqlock)->rlock){+.?...}, at: [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
{IN-SOFTIRQ-W} state was registered at:
  [<c04978ab>] __lock_acquire+0x47b/0x1050
  [<c0498504>] lock_acquire+0x84/0xf0
  [<c0835733>] _raw_spin_lock+0x33/0x40
  [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
  [<f9979f2a>] rt2x00queue_write_tx_frame+0x1a/0x300 [rt2x00lib]
  [<f997834f>] rt2x00mac_tx+0x7f/0x380 [rt2x00lib]
  [<f98fe363>] __ieee80211_tx+0x1b3/0x300 [mac80211]
  [<f98ffdf5>] ieee80211_tx+0x105/0x130 [mac80211]
  [<f99000dd>] ieee80211_xmit+0xad/0x100 [mac80211]
  [<f9900519>] ieee80211_subif_start_xmit+0x2d9/0x930 [mac80211]
  [<c0782e87>] dev_hard_start_xmit+0x307/0x660
  [<c079bb71>] sch_direct_xmit+0xa1/0x1e0
  [<c0784bb3>] dev_queue_xmit+0x183/0x730
  [<c078c27a>] neigh_resolve_output+0xfa/0x1e0
  [<c07b436a>] ip_finish_output+0x24a/0x460
  [<c07b4897>] ip_output+0xb7/0x100
  [<c07b2d60>] ip_local_out+0x20/0x60
  [<c07e01ff>] igmpv3_sendpack+0x4f/0x60
  [<c07e108f>] igmp_ifc_timer_expire+0x29f/0x330
  [<c04520fc>] run_timer_softirq+0x15c/0x2f0
  [<c0449e3e>] __do_softirq+0xae/0x1e0
irq event stamp: 18380437
hardirqs last  enabled at (18380437): [<c0526027>] __slab_alloc.clone.3+0x67/0x5f0
hardirqs last disabled at (18380436): [<c0525ff3>] __slab_alloc.clone.3+0x33/0x5f0
softirqs last  enabled at (18377616): [<c0449eb3>] __do_softirq+0x123/0x1e0
softirqs last disabled at (18377611): [<c041278d>] do_softirq+0x9d/0xe0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&intf->seqlock)->rlock);
  <Interrupt>
    lock(&(&intf->seqlock)->rlock);

 *** DEADLOCK ***

4 locks held by kworker/u:2/30374:
 #0:  (wiphy_name(local->hw.wiphy)){++++.+}, at: [<c045cf99>] process_one_work+0x109/0x3f0
 #1:  ((&sdata->work)){+.+.+.}, at: [<c045cf99>] process_one_work+0x109/0x3f0
 #2:  (&ifibss->mtx){+.+.+.}, at: [<f98f005b>] ieee80211_ibss_work+0x1b/0x470 [mac80211]
 #3:  (&intf->beacon_skb_mutex){+.+...}, at: [<f997a644>] rt2x00queue_update_beacon+0x24/0x50 [rt2x00lib]

stack backtrace:
Pid: 30374, comm: kworker/u:2 Not tainted 3.4.0-wl+ #4
Call Trace:
 [<c04962a6>] print_usage_bug+0x1f6/0x220
 [<c0496a12>] mark_lock+0x2c2/0x300
 [<c0495ff0>] ? check_usage_forwards+0xc0/0xc0
 [<c04978ec>] __lock_acquire+0x4bc/0x1050
 [<c0527890>] ? __kmalloc_track_caller+0x1c0/0x1d0
 [<c0777fb6>] ? copy_skb_header+0x26/0x90
 [<c0498504>] lock_acquire+0x84/0xf0
 [<f9979a20>] ? rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<c0835733>] _raw_spin_lock+0x33/0x40
 [<f9979a20>] ? rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<f997a5cf>] rt2x00queue_update_beacon_locked+0x5f/0xb0 [rt2x00lib]
 [<f997a64d>] rt2x00queue_update_beacon+0x2d/0x50 [rt2x00lib]
 [<f9977e3a>] rt2x00mac_bss_info_changed+0x1ca/0x200 [rt2x00lib]
 [<f9977c70>] ? rt2x00mac_remove_interface+0x70/0x70 [rt2x00lib]
 [<f98e4dd0>] ieee80211_bss_info_change_notify+0xe0/0x1d0 [mac80211]
 [<f98ef7b8>] __ieee80211_sta_join_ibss+0x3b8/0x610 [mac80211]
 [<c0496ab4>] ? mark_held_locks+0x64/0xc0
 [<c0440012>] ? virt_efi_query_capsule_caps+0x12/0x50
 [<f98efb09>] ieee80211_sta_join_ibss+0xf9/0x140 [mac80211]
 [<f98f0456>] ieee80211_ibss_work+0x416/0x470 [mac80211]
 [<c0496d8b>] ? trace_hardirqs_on+0xb/0x10
 [<c077683b>] ? skb_dequeue+0x4b/0x70
 [<f98f207f>] ieee80211_iface_work+0x13f/0x230 [mac80211]
 [<c045cf99>] ? process_one_work+0x109/0x3f0
 [<c045d015>] process_one_work+0x185/0x3f0
 [<c045cf99>] ? process_one_work+0x109/0x3f0
 [<f98f1f40>] ? ieee80211_teardown_sdata+0xa0/0xa0 [mac80211]
 [<c045ed86>] worker_thread+0x116/0x270
 [<c045ec70>] ? manage_workers+0x1e0/0x1e0
 [<c0462f64>] kthread+0x84/0x90
 [<c0462ee0>] ? __init_kthread_worker+0x60/0x60
 [<c083d382>] kernel_thread_helper+0x6/0x10

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Helmut Schaa <helmut.schaa@googlemail.com>
Acked-by: Gertjan van Wingerde <gwingerde@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
…condition

commit 26c1917 upstream.

When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.

PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
 #0 [f06a9dd8] crash_kexec at c049b5ec
 #1 [f06a9e2c] oops_end at c083d1c2
 #2 [f06a9e40] no_context at c0433ded
 #3 [f06a9e64] bad_area_nosemaphore at c043401a
 #4 [f06a9e6c] __do_page_fault at c0434493
 #5 [f06a9eec] do_page_fault at c083eb45
 #6 [f06a9f04] error_code (via page_fault) at c083c5d5
    EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
    00000000
    DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
    CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
 #7 [f06a9f38] _spin_lock at c083bc14
 #8 [f06a9f44] sys_mincore at c0507b7d
 #9 [f06a9fb0] system_call at c083becd
                         start           len
    EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
    SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
    CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286

This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.

With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.

This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.

Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.

NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.

This bug was discovered and fully debugged by Ulrich, quote:

----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.

    496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
    *pmd)
    497 {
    498         /* depend on compiler for an atomic pmd read */
    499         pmd_t pmdval = *pmd;

                                // edi = pmd pointer
0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
...
                                // edx = PTE page table high address
0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
...
                                // eax = PTE page table low address
0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax

[..]

Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.

-  The PMD entry {high|low} is 0x0000000000000000.
   The "mov" at 0xc0507a84 loads 0x00000000 into edx.

-  A page fault (on another CPU) sneaks in between the two "mov"
   instructions and instantiates the PMD.

-  The PMD entry {high|low} is now 0x00000003fda38067.
   The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit 3cf003c upstream.

[The async read code was broadened to include uncached reads in 3.5, so
the mainline patch did not apply directly. This patch is just a backport
to account for that change.]

Jian found that when he ran fsx on a 32 bit arch with a large wsize the
process and one of the bdi writeback kthreads would sometimes deadlock
with a stack trace like this:

crash> bt
PID: 2789   TASK: f02edaa0  CPU: 3   COMMAND: "fsx"
 #0 [eed63cbc] schedule at c083c5b3
 #1 [eed63d80] kmap_high at c0500ec8
 #2 [eed63db0] cifs_async_writev at f7fabcd7 [cifs]
 #3 [eed63df0] cifs_writepages at f7fb7f5c [cifs]
 #4 [eed63e50] do_writepages at c04f3e32
 #5 [eed63e54] __filemap_fdatawrite_range at c04e152a
 #6 [eed63ea4] filemap_fdatawrite at c04e1b3e
 #7 [eed63eb4] cifs_file_aio_write at f7fa111a [cifs]
 #8 [eed63ecc] do_sync_write at c052d202
 #9 [eed63f74] vfs_write at c052d4ee
#10 [eed63f94] sys_write at c052df4c
#11 [eed63fb0] ia32_sysenter_target at c0409a98
    EAX: 00000004  EBX: 00000003  ECX: abd73b73  EDX: 012a65c6
    DS:  007b      ESI: 012a65c6  ES:  007b      EDI: 00000000
    SS:  007b      ESP: bf8db178  EBP: bf8db1f8  GS:  0033
    CS:  0073      EIP: 40000424  ERR: 00000004  EFLAGS: 00000246

Each task would kmap part of its address array before getting stuck, but
not enough to actually issue the write.

This patch fixes this by serializing the marshal_iov operations for
async reads and writes. The idea here is to ensure that cifs
aggressively tries to populate a request before attempting to fulfill
another one. As soon as all of the pages are kmapped for a request, then
we can unlock and allow another one to proceed.

There's no need to do this serialization on non-CONFIG_HIGHMEM arches
however, so optimize all of this out when CONFIG_HIGHMEM isn't set.

Reported-by: Jian Li <jiali@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    #24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit eb48c07 upstream.

Each page mapped in a process's address space must be correctly
accounted for in _mapcount.  Normally the rules for this are
straightforward but hugetlbfs page table sharing is different.  The page
table pages at the PMD level are reference counted while the mapcount
remains the same.

If this accounting is wrong, it causes bugs like this one reported by
Larry Woodman:

  kernel BUG at mm/filemap.c:135!
  invalid opcode: 0000 [#1] SMP
  CPU 22
  Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
  Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
  RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
  Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
  Call Trace:
    delete_from_page_cache+0x40/0x80
    truncate_hugepages+0x115/0x1f0
    hugetlbfs_evict_inode+0x18/0x30
    evict+0x9f/0x1b0
    iput_final+0xe3/0x1e0
    iput+0x3e/0x50
    d_kill+0xf8/0x110
    dput+0xe2/0x1b0
    __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte.  The logic is if
the PMD page is the same, they must be shared.  This assumes that the
sharing is between the parent and child.  However, if the sharing is
with a different process entirely then this check fails as in this
diagram:

  parent
    |
    ------------>pmd
                 src_pte----------> data page
                                        ^
  other--------->pmd--------------------|
                  ^
  child-----------|
                 dst_pte

For this situation to occur, it must be possible for Parent and Other to
have faulted and failed to share page tables with each other.  This is
possible due to the following style of race.

  PROC A                                          PROC B
  copy_hugetlb_page_range                         copy_hugetlb_page_range
    src_pte == huge_pte_offset                      src_pte == huge_pte_offset
    !src_pte so no sharing                          !src_pte so no sharing

  (time passes)

  hugetlb_fault                                   hugetlb_fault
    huge_pte_alloc                                  huge_pte_alloc
      huge_pmd_share                                 huge_pmd_share
        LOCK(i_mmap_mutex)
        find nothing, no sharing
        UNLOCK(i_mmap_mutex)
                                                      LOCK(i_mmap_mutex)
                                                      find nothing, no sharing
                                                      UNLOCK(i_mmap_mutex)
      pmd_alloc                                       pmd_alloc
      LOCK(instantiation_mutex)
      fault
      UNLOCK(instantiation_mutex)
                                                  LOCK(instantiation_mutex)
                                                  fault
                                                  UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not
sharing page tables because the opportunity was missed.  When either
process later forks, the src_pte == dst pte is potentially insufficient.
As the check falls through, the wrong PTE information is copied in
(harmless but wrong) and the mapcount is bumped for a page mapped by a
shared page table leading to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd.  This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.

Race identified and changelog written mostly by Mel Gorman.

{akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
Reported-by: Larry Woodman <lwoodman@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit bea6832 upstream.

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit 61a0cfb upstream.

If SMP fails, we should always cancel security_timer delayed work.
Otherwise, security_timer function may run after l2cap_conn object
has been freed.

This patch fixes the following warning reported by ODEBUG:

WARNING: at lib/debugobjects.c:261 debug_print_object+0x7c/0x8d()
Hardware name: Bochs
ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x27
Modules linked in: btusb bluetooth
Pid: 440, comm: kworker/u:2 Not tainted 3.5.0-rc1+ #4
Call Trace:
 [<ffffffff81174600>] ? free_obj_work+0x4a/0x7f
 [<ffffffff81023eb8>] warn_slowpath_common+0x7e/0x97
 [<ffffffff81023f65>] warn_slowpath_fmt+0x41/0x43
 [<ffffffff811746b1>] debug_print_object+0x7c/0x8d
 [<ffffffff810394f0>] ? __queue_work+0x241/0x241
 [<ffffffff81174fdd>] debug_check_no_obj_freed+0x92/0x159
 [<ffffffff810ac08e>] slab_free_hook+0x6f/0x77
 [<ffffffffa0019145>] ? l2cap_conn_del+0x148/0x157 [bluetooth]
 [<ffffffff810ae408>] kfree+0x59/0xac
 [<ffffffffa0019145>] l2cap_conn_del+0x148/0x157 [bluetooth]
 [<ffffffffa001b9a2>] l2cap_recv_frame+0xa77/0xfa4 [bluetooth]
 [<ffffffff810592f9>] ? trace_hardirqs_on_caller+0x112/0x1ad
 [<ffffffffa001c86c>] l2cap_recv_acldata+0xe2/0x264 [bluetooth]
 [<ffffffffa0002b2f>] hci_rx_work+0x235/0x33c [bluetooth]
 [<ffffffff81038dc3>] ? process_one_work+0x126/0x2fe
 [<ffffffff81038e22>] process_one_work+0x185/0x2fe
 [<ffffffff81038dc3>] ? process_one_work+0x126/0x2fe
 [<ffffffff81059f2e>] ? lock_acquired+0x1b5/0x1cf
 [<ffffffffa00028fa>] ? le_scan_work+0x11d/0x11d [bluetooth]
 [<ffffffff81036fb6>] ? spin_lock_irq+0x9/0xb
 [<ffffffff81039209>] worker_thread+0xcf/0x175
 [<ffffffff8103913a>] ? rescuer_thread+0x175/0x175
 [<ffffffff8103cfe0>] kthread+0x95/0x9d
 [<ffffffff812c5054>] kernel_threadi_helper+0x4/0x10
 [<ffffffff812c36b0>] ? retint_restore_args+0x13/0x13
 [<ffffffff8103cf4b>] ? flush_kthread_worker+0xdb/0xdb
 [<ffffffff812c5050>] ? gs_change+0x13/0x13

This bug can be reproduced using hctool lecc or l2test tools and
bluetoothd not running.

Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit 1a7bbda upstream.

We actually do not do anything about it. Just return a default
value of zero and if the kernel tries to write anything but 0
we BUG_ON.

This fixes the case when an user tries to suspend the machine
and it blows up in save_processor_state b/c 'read_cr8' is set
to NULL and we get:

kernel BUG at /home/konrad/ssd/linux/arch/x86/include/asm/paravirt.h:100!
invalid opcode: 0000 [#1] SMP
Pid: 2687, comm: init.late Tainted: G           O 3.6.0upstream-00002-gac264ac-dirty #4 Bochs Bochs
RIP: e030:[<ffffffff814d5f42>]  [<ffffffff814d5f42>] save_processor_state+0x212/0x270

.. snip..
Call Trace:
 [<ffffffff810733bf>] do_suspend_lowlevel+0xf/0xac
 [<ffffffff8107330c>] ? x86_acpi_suspend_lowlevel+0x10c/0x150
 [<ffffffff81342ee2>] acpi_suspend_enter+0x57/0xd5

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit 1022623 upstream.

In 32 bit the stack address provided by kernel_stack_pointer() may
point to an invalid range causing NULL pointer access or page faults
while in NMI (see trace below). This happens if called in softirq
context and if the stack is empty. The address at &regs->sp is then
out of range.

Fixing this by checking if regs and &regs->sp are in the same stack
context. Otherwise return the previous stack pointer stored in struct
thread_info. If that address is invalid too, return address of regs.

 BUG: unable to handle kernel NULL pointer dereference at 0000000a
 IP: [<c1004237>] print_context_stack+0x6e/0x8d
 *pde = 00000000
 Oops: 0000 [#1] SMP
 Modules linked in:
 Pid: 4434, comm: perl Not tainted 3.6.0-rc3-oprofile-i386-standard-g4411a05 #4 Hewlett-Packard HP xw9400 Workstation/0A1Ch
 EIP: 0060:[<c1004237>] EFLAGS: 00010093 CPU: 0
 EIP is at print_context_stack+0x6e/0x8d
 EAX: ffffe000 EBX: 0000000a ECX: f4435f94 EDX: 0000000a
 ESI: f4435f94 EDI: f4435f94 EBP: f5409ec0 ESP: f5409ea0
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 CR0: 8005003b CR2: 0000000a CR3: 34ac9000 CR4: 000007d0
 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
 DR6: ffff0ff0 DR7: 00000400
 Process perl (pid: 4434, ti=f5408000 task=f5637850 task.ti=f4434000)
 Stack:
  000003e8 ffffe000 00001ffc f4e39b00 00000000 0000000a f4435f94 c155198c
  f5409ef0 c1003723 c155198c f5409f04 00000000 f5409edc 00000000 00000000
  f5409ee8 f4435f94 f5409fc4 00000001 f5409f1c c12dce1c 00000000 c155198c
 Call Trace:
  [<c1003723>] dump_trace+0x7b/0xa1
  [<c12dce1c>] x86_backtrace+0x40/0x88
  [<c12db712>] ? oprofile_add_sample+0x56/0x84
  [<c12db731>] oprofile_add_sample+0x75/0x84
  [<c12ddb5b>] op_amd_check_ctrs+0x46/0x260
  [<c12dd40d>] profile_exceptions_notify+0x23/0x4c
  [<c1395034>] nmi_handle+0x31/0x4a
  [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [<c13950ed>] do_nmi+0xa0/0x2ff
  [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [<c13949e5>] nmi_stack_correct+0x28/0x2d
  [<c1029dc5>] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [<c1003603>] ? do_softirq+0x4b/0x7f
  <IRQ>
  [<c102a06f>] irq_exit+0x35/0x5b
  [<c1018f56>] smp_apic_timer_interrupt+0x6c/0x7a
  [<c1394746>] apic_timer_interrupt+0x2a/0x30
 Code: 89 fe eb 08 31 c9 8b 45 0c ff 55 ec 83 c3 04 83 7d 10 00 74 0c 3b 5d 10 73 26 3b 5d e4 73 0c eb 1f 3b 5d f0 76 1a 3b 5d e8 73 15 <8b> 13 89 d0 89 55 e0 e8 ad 42 03 00 85 c0 8b 55 e0 75 a6 eb cc
 EIP: [<c1004237>] print_context_stack+0x6e/0x8d SS:ESP 0068:f5409ea0
 CR2: 000000000000000a
 ---[ end trace 62afee3481b00012 ]---
 Kernel panic - not syncing: Fatal exception in interrupt

V2:
* add comments to kernel_stack_pointer()
* always return a valid stack address by falling back to the address
  of regs

Reported-by: Yang Wei <wei.yang@windriver.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Link: http://lkml.kernel.org/r/20120912135059.GZ8285@erda.amd.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jun Zhang <jun.zhang@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit 412d32e upstream.

A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling
off, never to be seen again.  In the case where this occurred, an exiting
thread hit reiserfs homebrew conditional resched while holding a mutex,
bringing the box to its knees.

PID: 18105  TASK: ffff8807fd412180  CPU: 5   COMMAND: "kdmflush"
 #0 [ffff8808157e7670] schedule at ffffffff8143f489
 #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs]
 #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14
 #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs]
 #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2
 #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41
 #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a
 #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88
 #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850
 #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f
    [exception RIP: kernel_thread_helper]
    RIP: ffffffff8144a5c0  RSP: ffff8808157e7f58  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffffffff8107af60  RDI: ffff8803ee491d18
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blechd0se pushed a commit that referenced this pull request Dec 3, 2013
commit 66efdc7 upstream.

snd_seq_timer_open() didn't catch the whole error path but let through
if the timer id is a slave.  This may lead to Oops by accessing the
uninitialized pointer.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000002ae
 IP: [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 PGD 785cd067 PUD 76964067 PMD 0
 Oops: 0002 [#4] SMP
 CPU 0
 Pid: 4288, comm: trinity-child7 Tainted: G      D W 3.9.0-rc1+ #100 Bochs Bochs
 RIP: 0010:[<ffffffff819b3477>]  [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 RSP: 0018:ffff88006ece7d38  EFLAGS: 00010246
 RAX: 0000000000000286 RBX: ffff88007851b400 RCX: 0000000000000000
 RDX: 000000000000ffff RSI: ffff88006ece7d58 RDI: ffff88006ece7d38
 RBP: ffff88006ece7d98 R08: 000000000000000a R09: 000000000000fffe
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: ffff8800792c5400 R14: 0000000000e8f000 R15: 0000000000000007
 FS:  00007f7aaa650700(0000) GS:ffff88007f800000(0000) GS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000000002ae CR3: 000000006efec000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process trinity-child7 (pid: 4288, threadinfo ffff88006ece6000, task ffff880076a8a290)
 Stack:
  0000000000000286 ffffffff828f2be0 ffff88006ece7d58 ffffffff810f354d
  65636e6575716573 2065756575712072 ffff8800792c0030 0000000000000000
  ffff88006ece7d98 ffff8800792c5400 ffff88007851b400 ffff8800792c5520
 Call Trace:
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff819b17e9>] snd_seq_queue_timer_open+0x29/0x70
  [<ffffffff819ae01a>] snd_seq_ioctl_set_queue_timer+0xda/0x120
  [<ffffffff819acb9b>] snd_seq_do_ioctl+0x9b/0xd0
  [<ffffffff819acbe0>] snd_seq_ioctl+0x10/0x20
  [<ffffffff811b9542>] do_vfs_ioctl+0x522/0x570
  [<ffffffff8130a4b3>] ? file_has_perm+0x83/0xa0
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff811b95ed>] sys_ioctl+0x5d/0xa0
  [<ffffffff813663fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81faed69>] system_call_fastpath+0x16/0x1b

Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Quarx2k pushed a commit that referenced this pull request Apr 5, 2014
The new_node_page() is processed as the following procedure.

1. A new node page is allocated.
2. Set PageUptodate with proper footer information.
3. Check if there is a free space for allocation
 4.a. If there is no space, f2fs returns with -ENOSPC.
 4.b. Otherwise, go next.

In the case of step #4.a, f2fs remains a wrong node page in the page cache
with the uptodate flag.

Also, even though a new node page is allocated successfully, an error can be
occurred afterwards due to allocation failure of the other data structures.
In such a case, remove_inode_page() would be triggered, so that we have to
clear uptodate flag in truncate_node() too.

So, we should remove the uptodate flag, if allocation is failed.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Quarx2k pushed a commit that referenced this pull request May 12, 2014
The following lockdep problem was reported by Or Gerlitz <ogerlitz@mellanox.com>:

    [ INFO: possible recursive locking detected ]
    3.3.0-32035-g1b2649e-dirty #4 Not tainted
    ---------------------------------------------
    kworker/5:1/418 is trying to acquire lock:
     (&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0138a41>] rdma_destroy_i    d+0x33/0x1f0 [rdma_cm]

    but task is already holding lock:
     (&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0135130>] cma_disable_ca    llback+0x24/0x45 [rdma_cm]

    other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&id_priv->handler_mutex);
      lock(&id_priv->handler_mutex);

     *** DEADLOCK ***

     May be due to missing lock nesting notation

    3 locks held by kworker/5:1/418:
     #0:  (ib_cm){.+.+.+}, at: [<ffffffff81042ac1>] process_one_work+0x210/0x4a    6
     #1:  ((&(&work->work)->work)){+.+.+.}, at: [<ffffffff81042ac1>] process_on    e_work+0x210/0x4a6
     #2:  (&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0135130>] cma_disab    le_callback+0x24/0x45 [rdma_cm]

    stack backtrace:
    Pid: 418, comm: kworker/5:1 Not tainted 3.3.0-32035-g1b2649e-dirty #4
    Call Trace:
     [<ffffffff8102b0fb>] ? console_unlock+0x1f4/0x204
     [<ffffffff81068771>] __lock_acquire+0x16b5/0x174e
     [<ffffffff8106461f>] ? save_trace+0x3f/0xb3
     [<ffffffff810688fa>] lock_acquire+0xf0/0x116
     [<ffffffffa0138a41>] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm]
     [<ffffffff81364351>] mutex_lock_nested+0x64/0x2ce
     [<ffffffffa0138a41>] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm]
     [<ffffffff81065a78>] ? trace_hardirqs_on_caller+0x11e/0x155
     [<ffffffff81065abc>] ? trace_hardirqs_on+0xd/0xf
     [<ffffffffa0138a41>] rdma_destroy_id+0x33/0x1f0 [rdma_cm]
     [<ffffffffa0139c02>] cma_req_handler+0x418/0x644 [rdma_cm]
     [<ffffffffa012ee88>] cm_process_work+0x32/0x119 [ib_cm]
     [<ffffffffa0130299>] cm_req_handler+0x928/0x982 [ib_cm]
     [<ffffffffa01302f3>] ? cm_req_handler+0x982/0x982 [ib_cm]
     [<ffffffffa0130326>] cm_work_handler+0x33/0xfe5 [ib_cm]
     [<ffffffff81065a78>] ? trace_hardirqs_on_caller+0x11e/0x155
     [<ffffffffa01302f3>] ? cm_req_handler+0x982/0x982 [ib_cm]
     [<ffffffff81042b6e>] process_one_work+0x2bd/0x4a6
     [<ffffffff81042ac1>] ? process_one_work+0x210/0x4a6
     [<ffffffff813669f3>] ? _raw_spin_unlock_irq+0x2b/0x40
     [<ffffffff8104316e>] worker_thread+0x1d6/0x350
     [<ffffffff81042f98>] ? rescuer_thread+0x241/0x241
     [<ffffffff81046a32>] kthread+0x84/0x8c
     [<ffffffff8136e854>] kernel_thread_helper+0x4/0x10
     [<ffffffff81366d59>] ? retint_restore_args+0xe/0xe
     [<ffffffff810469ae>] ? __init_kthread_worker+0x56/0x56
     [<ffffffff8136e850>] ? gs_change+0xb/0xb

The actual locking is fine, since we're dealing with different locks,
but from the same lock class.  cma_disable_callback() acquires the
listening id mutex, whereas rdma_destroy_id() acquires the mutex for
the new connection id.  To fix this, delay the call to
rdma_destroy_id() until we've released the listening id mutex.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Quarx2k pushed a commit that referenced this pull request May 12, 2014
platform_device pdev can be NULL if CONFIG_MMC_OMAP_HS is not set.
Add check for NULL pointer. while at it move the duplicated functions
to omap4-common.c

Fixes the following boot crash seen with omap4sdp and omap4panda
when MMC is disabled.

Unable to handle kernel NULL pointer dereference at virtual address 0000008c
pgd = c0004000
[0000008c] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0    Not tainted  (3.4.0-rc1-05971-ga4dfa82 #4)
PC is at omap_4430sdp_init+0x184/0x410
LR is at device_add+0x1a0/0x664

Signed-off-by: Balaji T K <balajitk@ti.com>
Reported-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Quarx2k pushed a commit that referenced this pull request May 12, 2014
xfs_sync_worker checks the MS_ACTIVE flag in s_flags to avoid doing
work during mount and unmount.  This flag can be cleared by unmount
after the xfs_sync_worker checks it but before the work is completed.
The has caused crashes in the completion handler for the dummy
transaction commited by xfs_sync_worker:

PID: 27544  TASK: ffff88013544e040  CPU: 3   COMMAND: "kworker/3:0"
 #0 [ffff88016fdff930] machine_kexec at ffffffff810244e9
 #1 [ffff88016fdff9a0] crash_kexec at ffffffff8108d053
 #2 [ffff88016fdffa70] oops_end at ffffffff813ad1b8
 #3 [ffff88016fdffaa0] no_context at ffffffff8102bd48
 #4 [ffff88016fdffaf0] __bad_area_nosemaphore at ffffffff8102c04d
 #5 [ffff88016fdffb40] bad_area_nosemaphore at ffffffff8102c12e
 #6 [ffff88016fdffb50] do_page_fault at ffffffff813afaee
 #7 [ffff88016fdffc60] page_fault at ffffffff813ac635
    [exception RIP: xlog_get_lowest_lsn+0x30]
    RIP: ffffffffa04a9910  RSP: ffff88016fdffd10  RFLAGS: 00010246
    RAX: ffffc90014e48000  RBX: ffff88014d879980  RCX: ffff88014d879980
    RDX: ffff8802214ee4c0  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88016fdffd10   R8: ffff88014d879a80   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000000  R12: ffff8802214ee400
    R13: ffff88014d879980  R14: 0000000000000000  R15: ffff88022fd96605
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff88016fdffd18] xlog_state_do_callback at ffffffffa04aa186 [xfs]
 #9 [ffff88016fdffd98] xlog_state_done_syncing at ffffffffa04aa568 [xfs]

Protect xfs_sync_worker by using the s_umount semaphore at the read
level to provide exclusion with unmount while work is progressing.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Quarx2k pushed a commit that referenced this pull request May 12, 2014
While traversing the linked list of open file handles, if the identfied
file handle is invalid, a reopen is attempted and if it fails, we
resume traversing where we stopped and cifs can oops while accessing
invalid next element, for list might have changed.

So mark the invalid file handle and attempt reopen if no
valid file handle is found in rest of the list.
If reopen fails, move the invalid file handle to the end of the list
and start traversing the list again from the begining.
Repeat this four times before giving up and returning an error if
file reopen keeps failing.

Cc: <stable@vger.kernel.org>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Quarx2k pushed a commit that referenced this pull request May 12, 2014
Pull CIFS updates from Steve French.

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (29 commits)
  cifs: fix oops while traversing open file list (try #4)
  cifs: Fix comment as d_alloc_root() is replaced by d_make_root()
  CIFS: Introduce SMB2 mounts as vers=2.1
  CIFS: Introduce SMB2 Kconfig option
  CIFS: Move add/set_credits and get_credits_field to ops structure
  CIFS: Move protocol specific demultiplex thread calls to ops struct
  CIFS: Move protocol specific part from cifs_readv_receive to ops struct
  CIFS: Move header_size/max_header_size to ops structure
  CIFS: Move protocol specific part from SendReceive2 to ops struct
  cifs: Include backup intent search flags during searches {try #2)
  CIFS: Separate protocol specific part from setlk
  CIFS: Separate protocol specific part from getlk
  CIFS: Separate protocol specific lock type handling
  CIFS: Convert lock type to 32 bit variable
  CIFS: Move locks to cifsFileInfo structure
  cifs: convert send_nt_cancel into a version specific op
  cifs: add a smb_version_operations/values structures and a smb_version enum
  cifs: remove the vers= and version= synonyms for ver=
  cifs: add warning about change in default cache semantics in 3.7
  cifs: display cache= option in /proc/mounts
  ...
Quarx2k pushed a commit that referenced this pull request May 12, 2014
…condition

When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.

PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
 #0 [f06a9dd8] crash_kexec at c049b5ec
 #1 [f06a9e2c] oops_end at c083d1c2
 #2 [f06a9e40] no_context at c0433ded
 #3 [f06a9e64] bad_area_nosemaphore at c043401a
 #4 [f06a9e6c] __do_page_fault at c0434493
 #5 [f06a9eec] do_page_fault at c083eb45
 #6 [f06a9f04] error_code (via page_fault) at c083c5d5
    EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
    00000000
    DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
    CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
 #7 [f06a9f38] _spin_lock at c083bc14
 #8 [f06a9f44] sys_mincore at c0507b7d
 #9 [f06a9fb0] system_call at c083becd
                         start           len
    EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
    SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
    CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286

This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.

With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.

This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.

Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.

NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.

This bug was discovered and fully debugged by Ulrich, quote:

----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.

    496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
    *pmd)
    497 {
    498         /* depend on compiler for an atomic pmd read */
    499         pmd_t pmdval = *pmd;

                                // edi = pmd pointer
0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
...
                                // edx = PTE page table high address
0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
...
                                // eax = PTE page table low address
0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax

[..]

Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.

-  The PMD entry {high|low} is 0x0000000000000000.
   The "mov" at 0xc0507a84 loads 0x00000000 into edx.

-  A page fault (on another CPU) sneaks in between the two "mov"
   instructions and instantiates the PMD.

-  The PMD entry {high|low} is now 0x00000003fda38067.
   The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Quarx2k pushed a commit that referenced this pull request May 12, 2014
Remove spinlock as atomic_t can be used instead. Note we use only 16
lower bits, upper bits are changed but we impilcilty cast to u16.

This fix possible deadlock on IBSS mode reproted by lockdep:

=================================
[ INFO: inconsistent lock state ]
3.4.0-wl+ #4 Not tainted
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kworker/u:2/30374 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&intf->seqlock)->rlock){+.?...}, at: [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
{IN-SOFTIRQ-W} state was registered at:
  [<c04978ab>] __lock_acquire+0x47b/0x1050
  [<c0498504>] lock_acquire+0x84/0xf0
  [<c0835733>] _raw_spin_lock+0x33/0x40
  [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
  [<f9979f2a>] rt2x00queue_write_tx_frame+0x1a/0x300 [rt2x00lib]
  [<f997834f>] rt2x00mac_tx+0x7f/0x380 [rt2x00lib]
  [<f98fe363>] __ieee80211_tx+0x1b3/0x300 [mac80211]
  [<f98ffdf5>] ieee80211_tx+0x105/0x130 [mac80211]
  [<f99000dd>] ieee80211_xmit+0xad/0x100 [mac80211]
  [<f9900519>] ieee80211_subif_start_xmit+0x2d9/0x930 [mac80211]
  [<c0782e87>] dev_hard_start_xmit+0x307/0x660
  [<c079bb71>] sch_direct_xmit+0xa1/0x1e0
  [<c0784bb3>] dev_queue_xmit+0x183/0x730
  [<c078c27a>] neigh_resolve_output+0xfa/0x1e0
  [<c07b436a>] ip_finish_output+0x24a/0x460
  [<c07b4897>] ip_output+0xb7/0x100
  [<c07b2d60>] ip_local_out+0x20/0x60
  [<c07e01ff>] igmpv3_sendpack+0x4f/0x60
  [<c07e108f>] igmp_ifc_timer_expire+0x29f/0x330
  [<c04520fc>] run_timer_softirq+0x15c/0x2f0
  [<c0449e3e>] __do_softirq+0xae/0x1e0
irq event stamp: 18380437
hardirqs last  enabled at (18380437): [<c0526027>] __slab_alloc.clone.3+0x67/0x5f0
hardirqs last disabled at (18380436): [<c0525ff3>] __slab_alloc.clone.3+0x33/0x5f0
softirqs last  enabled at (18377616): [<c0449eb3>] __do_softirq+0x123/0x1e0
softirqs last disabled at (18377611): [<c041278d>] do_softirq+0x9d/0xe0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&intf->seqlock)->rlock);
  <Interrupt>
    lock(&(&intf->seqlock)->rlock);

 *** DEADLOCK ***

4 locks held by kworker/u:2/30374:
 #0:  (wiphy_name(local->hw.wiphy)){++++.+}, at: [<c045cf99>] process_one_work+0x109/0x3f0
 #1:  ((&sdata->work)){+.+.+.}, at: [<c045cf99>] process_one_work+0x109/0x3f0
 #2:  (&ifibss->mtx){+.+.+.}, at: [<f98f005b>] ieee80211_ibss_work+0x1b/0x470 [mac80211]
 #3:  (&intf->beacon_skb_mutex){+.+...}, at: [<f997a644>] rt2x00queue_update_beacon+0x24/0x50 [rt2x00lib]

stack backtrace:
Pid: 30374, comm: kworker/u:2 Not tainted 3.4.0-wl+ #4
Call Trace:
 [<c04962a6>] print_usage_bug+0x1f6/0x220
 [<c0496a12>] mark_lock+0x2c2/0x300
 [<c0495ff0>] ? check_usage_forwards+0xc0/0xc0
 [<c04978ec>] __lock_acquire+0x4bc/0x1050
 [<c0527890>] ? __kmalloc_track_caller+0x1c0/0x1d0
 [<c0777fb6>] ? copy_skb_header+0x26/0x90
 [<c0498504>] lock_acquire+0x84/0xf0
 [<f9979a20>] ? rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<c0835733>] _raw_spin_lock+0x33/0x40
 [<f9979a20>] ? rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<f997a5cf>] rt2x00queue_update_beacon_locked+0x5f/0xb0 [rt2x00lib]
 [<f997a64d>] rt2x00queue_update_beacon+0x2d/0x50 [rt2x00lib]
 [<f9977e3a>] rt2x00mac_bss_info_changed+0x1ca/0x200 [rt2x00lib]
 [<f9977c70>] ? rt2x00mac_remove_interface+0x70/0x70 [rt2x00lib]
 [<f98e4dd0>] ieee80211_bss_info_change_notify+0xe0/0x1d0 [mac80211]
 [<f98ef7b8>] __ieee80211_sta_join_ibss+0x3b8/0x610 [mac80211]
 [<c0496ab4>] ? mark_held_locks+0x64/0xc0
 [<c0440012>] ? virt_efi_query_capsule_caps+0x12/0x50
 [<f98efb09>] ieee80211_sta_join_ibss+0xf9/0x140 [mac80211]
 [<f98f0456>] ieee80211_ibss_work+0x416/0x470 [mac80211]
 [<c0496d8b>] ? trace_hardirqs_on+0xb/0x10
 [<c077683b>] ? skb_dequeue+0x4b/0x70
 [<f98f207f>] ieee80211_iface_work+0x13f/0x230 [mac80211]
 [<c045cf99>] ? process_one_work+0x109/0x3f0
 [<c045d015>] process_one_work+0x185/0x3f0
 [<c045cf99>] ? process_one_work+0x109/0x3f0
 [<f98f1f40>] ? ieee80211_teardown_sdata+0xa0/0xa0 [mac80211]
 [<c045ed86>] worker_thread+0x116/0x270
 [<c045ec70>] ? manage_workers+0x1e0/0x1e0
 [<c0462f64>] kthread+0x84/0x90
 [<c0462ee0>] ? __init_kthread_worker+0x60/0x60
 [<c083d382>] kernel_thread_helper+0x6/0x10

Cc: stable@vger.kernel.org
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Helmut Schaa <helmut.schaa@googlemail.com>
Acked-by: Gertjan van Wingerde <gwingerde@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Quarx2k pushed a commit that referenced this pull request May 12, 2014
The warning below triggers on AMD MCM packages because physical package
IDs on the cores of a _physical_ socket are the same. I.e., this field
says which CPUs belong to the same physical package.

However, the same two CPUs belong to two different internal, i.e.
"logical" nodes in the same physical socket which is reflected in the
CPU-to-node map on x86 with NUMA.

Which makes this check wrong on the above topologies so circumvent it.

[    0.444413] Booting Node   0, Processors  #1 #2 #3 #4 #5 Ok.
[    0.461388] ------------[ cut here ]------------
[    0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[    0.473960] Hardware name: Dinar
[    0.477170] sched: CPU #6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.486860] Booting Node   1, Processors  #6
[    0.491104] Modules linked in:
[    0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[    0.499510] Call Trace:
[    0.501946]  [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[    0.508185]  [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[    0.514163]  [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[    0.519881]  [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[    0.525943]  [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[    0.532004]  [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[    0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.628197]  #7 #8 #9 #10 #11 Ok.
[    0.807108] Booting Node   3, Processors  #12 #13 #14 #15 #16 #17 Ok.
[    0.897587] Booting Node   2, Processors  #18 #19 #20 #21 #22 #23 Ok.
[    0.917443] Brought up 24 CPUs

We ran a topology sanity check test we have here on it and
it all looks ok... hopefully :).

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120529135442.GE29157@aftab.osrc.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Andrey reported the following report:

ERROR: AddressSanitizer: heap-buffer-overflow on address ffff8800359c99f3
ffff8800359c99f3 is located 0 bytes to the right of 243-byte region [ffff8800359c9900, ffff8800359c99f3)
Accessed by thread T13003:
  #0 ffffffff810dd2da (asan_report_error+0x32a/0x440)
  #1 ffffffff810dc6b0 (asan_check_region+0x30/0x40)
  #2 ffffffff810dd4d3 (__tsan_write1+0x13/0x20)
  #3 ffffffff811cd19e (ftrace_regex_release+0x1be/0x260)
  #4 ffffffff812a1065 (__fput+0x155/0x360)
  #5 ffffffff812a12de (____fput+0x1e/0x30)
  #6 ffffffff8111708d (task_work_run+0x10d/0x140)
  #7 ffffffff810ea043 (do_exit+0x433/0x11f0)
  #8 ffffffff810eaee4 (do_group_exit+0x84/0x130)
  #9 ffffffff810eafb1 (SyS_exit_group+0x21/0x30)
  #10 ffffffff81928782 (system_call_fastpath+0x16/0x1b)

Allocated by thread T5167:
  #0 ffffffff810dc778 (asan_slab_alloc+0x48/0xc0)
  #1 ffffffff8128337c (__kmalloc+0xbc/0x500)
  #2 ffffffff811d9d54 (trace_parser_get_init+0x34/0x90)
  #3 ffffffff811cd7b3 (ftrace_regex_open+0x83/0x2e0)
  #4 ffffffff811cda7d (ftrace_filter_open+0x2d/0x40)
  #5 ffffffff8129b4ff (do_dentry_open+0x32f/0x430)
  #6 ffffffff8129b668 (finish_open+0x68/0xa0)
  #7 ffffffff812b66ac (do_last+0xb8c/0x1710)
  #8 ffffffff812b7350 (path_openat+0x120/0xb50)
  #9 ffffffff812b8884 (do_filp_open+0x54/0xb0)
  #10 ffffffff8129d36c (do_sys_open+0x1ac/0x2c0)
  #11 ffffffff8129d4b7 (SyS_open+0x37/0x50)
  #12 ffffffff81928782 (system_call_fastpath+0x16/0x1b)

Shadow bytes around the buggy address:
  ffff8800359c9700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  ffff8800359c9780: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  ffff8800359c9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  ffff8800359c9880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  ffff8800359c9900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>ffff8800359c9980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[03]fb
  ffff8800359c9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  ffff8800359c9a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  ffff8800359c9b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  ffff8800359c9b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff8800359c9c00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap redzone:          fa
  Heap kmalloc redzone:  fb
  Freed heap region:     fd
  Shadow gap:            fe

The out-of-bounds access happens on 'parser->buffer[parser->idx] = 0;'

Although the crash happened in ftrace_regex_open() the real bug
occurred in trace_get_user() where there's an incrementation to
parser->idx without a check against the size. The way it is triggered
is if userspace sends in 128 characters (EVENT_BUF_SIZE + 1), the loop
that reads the last character stores it and then breaks out because
there is no more characters. Then the last character is read to determine
what to do next, and the index is incremented without checking size.

Then the caller of trace_get_user() usually nulls out the last character
with a zero, but since the index is equal to the size, it writes a nul
character after the allocated space, which can corrupt memory.

Luckily, only root user has write access to this file.

Link: http://lkml.kernel.org/r/20131009222323.04fd1a0d@gandalf.local.home

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
…ux/kernel/git/tip/tip

Pull x86 boot changes from Ingo Molnar:
 "Two changes that prettify and compactify the SMP bootup output from:

     smpboot: Booting Node   0, Processors  #1 #2 #3 OK
     smpboot: Booting Node   1, Processors  #4 #5 #6 #7 OK
     smpboot: Booting Node   2, Processors  #8 #9 #10 #11 OK
     smpboot: Booting Node   3, Processors  #12 #13 #14 #15 OK
     Brought up 16 CPUs

  to something like:

     x86: Booting SMP configuration:
     .... node  #0, CPUs:        #1  #2  #3
     .... node  #1, CPUs:    #4  #5  #6  #7
     .... node  #2, CPUs:    #8  #9 #10 #11
     .... node  #3, CPUs:   #12 #13 #14 #15
     x86: Booted up 4 nodes, 16 CPUs"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Further compress CPUs bootup message
  x86: Improve the printout of the SMP bootup CPU table
Quarx2k pushed a commit that referenced this pull request May 29, 2014
As part of normal operaions, the hrtimer subsystem frequently calls
into the timekeeping code, creating a locking order of
  hrtimer locks -> timekeeping locks

clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
between the timekeeping the hrtimer subsystem, so that we could
notify the hrtimer subsytem the time had changed while holding
the timekeeping locks. This was done by scheduling delayed work
that would run later once we were out of the timekeeing code.

But unfortunately the lock chains are complex enoguh that in
scheduling delayed work, we end up eventually trying to grab
an hrtimer lock.

Sasha Levin noticed this in testing when the new seqlock lockdep
enablement triggered the following (somewhat abrieviated) message:

[  251.100221] ======================================================
[  251.100221] [ INFO: possible circular locking dependency detected ]
[  251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted
[  251.101967] -------------------------------------------------------
[  251.101967] kworker/10:1/4506 is trying to acquire lock:
[  251.101967]  (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70
[  251.101967]
[  251.101967] but task is already holding lock:
[  251.101967]  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
[  251.101967]
[  251.101967] which lock already depends on the new lock.
[  251.101967]
[  251.101967]
[  251.101967] the existing dependency chain (in reverse order) is:
[  251.101967]
-> #5 (hrtimer_bases.lock#11){-.-...}:
[snipped]
-> #4 (&rt_b->rt_runtime_lock){-.-...}:
[snipped]
-> #3 (&rq->lock){-.-.-.}:
[snipped]
-> #2 (&p->pi_lock){-.-.-.}:
[snipped]
-> #1 (&(&pool->lock)->rlock){-.-...}:
[  251.101967]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[  251.101967]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[  251.101967]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[  251.101967]        [<ffffffff84398500>] _raw_spin_lock+0x40/0x80
[  251.101967]        [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0
[  251.101967]        [<ffffffff81154168>] queue_work_on+0x98/0x120
[  251.101967]        [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30
[  251.101967]        [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160
[  251.101967]        [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70
[  251.101967]        [<ffffffff843a4b49>] ia32_sysret+0x0/0x5
[  251.101967]
-> #0 (timekeeper_seq){----..}:
[snipped]
[  251.101967] other info that might help us debug this:
[  251.101967]
[  251.101967] Chain exists of:
  timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11

[  251.101967]  Possible unsafe locking scenario:
[  251.101967]
[  251.101967]        CPU0                    CPU1
[  251.101967]        ----                    ----
[  251.101967]   lock(hrtimer_bases.lock#11);
[  251.101967]                                lock(&rt_b->rt_runtime_lock);
[  251.101967]                                lock(hrtimer_bases.lock#11);
[  251.101967]   lock(timekeeper_seq);
[  251.101967]
[  251.101967]  *** DEADLOCK ***
[  251.101967]
[  251.101967] 3 locks held by kworker/10:1/4506:
[  251.101967]  #0:  (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
[  251.101967]  #1:  (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
[  251.101967]  #2:  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
[  251.101967]
[  251.101967] stack backtrace:
[  251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053
[  251.101967] Workqueue: events clock_was_set_work

So the best solution is to avoid calling clock_was_set_delayed() while
holding the timekeeping lock, and instead using a flag variable to
decide if we should call clock_was_set() once we've released the locks.

This works for the case here, where the do_adjtimex() was the deadlock
trigger point. Unfortuantely, in update_wall_time() we still hold
the jiffies lock, which would deadlock with the ipi triggered by
clock_was_set(), preventing us from calling it even after we drop the
timekeeping lock. So instead call clock_was_set_delayed() at that point.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: stable <stable@vger.kernel.org> #3.10+
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Hayes Wang says:

====================
support new chip

Remove the trailing "/* CRC */" for patch #3.

Change the return value type of rtl_ops_init() from int to boolean
for patch #4.

Replace VENDOR_ID_SAMSUNG with SAMSUNG_VENDOR_ID for patch #6.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Use helper functions to simplify _DSM related code in TPM driver.

This patch also help to get rid of following warning messages:
[  163.509575] ACPI Error: Incorrect return type [Buffer] requested [Package]
(20130517/nsxfeval-135)

But there is still an warning left.
[  181.637366] ACPI Warning: \_SB_.IIO0.LPC0.TPM_._DSM: Argument #4 type
mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
When a task enters call_refreshresult with status 0 from call_refresh and
!rpcauth_uptodatecred(task) it enters call_refresh again with no rate-limiting
or max number of retries.

Instead of trying forever, make use of the retry path that other errors use.

This only seems to be possible when the crrefresh callback is gss_refresh_null,
which only happens when destroying the context.

To reproduce:

1) mount with sec=krb5 (or sec=sys with krb5 negotiated for non FSID specific
   operations).

2) reboot - the client will be stuck and will need to be hard rebooted

BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:2:46]
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache ppdev crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd serio_raw i2c_piix4 i2c_core e1000 parport_pc parport shpchp nfsd auth_rpcgss oid_registry exportfs nfs_acl lockd sunrpc autofs4 mptspi scsi_transport_spi mptscsih mptbase ata_generic floppy
irq event stamp: 195724
hardirqs last  enabled at (195723): [<ffffffff814a925c>] restore_args+0x0/0x30
hardirqs last disabled at (195724): [<ffffffff814b0a6a>] apic_timer_interrupt+0x6a/0x80
softirqs last  enabled at (195722): [<ffffffff8103f583>] __do_softirq+0x1df/0x276
softirqs last disabled at (195717): [<ffffffff8103f852>] irq_exit+0x53/0x9a
CPU: 0 PID: 46 Comm: kworker/0:2 Not tainted 3.13.0-rc3-branch-dros_testing+ #4
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
Workqueue: rpciod rpc_async_schedule [sunrpc]
task: ffff8800799c4260 ti: ffff880079002000 task.ti: ffff880079002000
RIP: 0010:[<ffffffffa0064fd4>]  [<ffffffffa0064fd4>] __rpc_execute+0x8a/0x362 [sunrpc]
RSP: 0018:ffff880079003d18  EFLAGS: 00000246
RAX: 0000000000000005 RBX: 0000000000000007 RCX: 0000000000000007
RDX: 0000000000000007 RSI: ffff88007aecbae8 RDI: ffff8800783d8900
RBP: ffff880079003d78 R08: ffff88006e30e9f8 R09: ffffffffa005a3d7
R10: ffff88006e30e7b0 R11: ffff8800783d8900 R12: ffffffffa006675e
R13: ffff880079003ce8 R14: ffff88006e30e7b0 R15: ffff8800783d8900
FS:  0000000000000000(0000) GS:ffff88007f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3072333000 CR3: 0000000001a0b000 CR4: 00000000001407f0
Stack:
 ffff880079003d98 0000000000000246 0000000000000000 ffff88007a9a4830
 ffff880000000000 ffffffff81073f47 ffff88007f212b00 ffff8800799c4260
 ffff8800783d8988 ffff88007f212b00 ffffe8ffff604800 0000000000000000
Call Trace:
 [<ffffffff81073f47>] ? trace_hardirqs_on_caller+0x145/0x1a1
 [<ffffffffa00652d3>] rpc_async_schedule+0x27/0x32 [sunrpc]
 [<ffffffff81052974>] process_one_work+0x211/0x3a5
 [<ffffffff810528d5>] ? process_one_work+0x172/0x3a5
 [<ffffffff81052eeb>] worker_thread+0x134/0x202
 [<ffffffff81052db7>] ? rescuer_thread+0x280/0x280
 [<ffffffff81052db7>] ? rescuer_thread+0x280/0x280
 [<ffffffff810584a0>] kthread+0xc9/0xd1
 [<ffffffff810583d7>] ? __kthread_parkme+0x61/0x61
 [<ffffffff814afd6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810583d7>] ? __kthread_parkme+0x61/0x61
Code: e8 87 63 fd e0 c6 05 10 dd 01 00 01 48 8b 43 70 4c 8d 6b 70 45 31 e4 a8 02 0f 85 d5 02 00 00 4c 8b 7b 48 48 c7 43 48 00 00 00 00 <4c> 8b 4b 50 4d 85 ff 75 0c 4d 85 c9 4d 89 cf 0f 84 32 01 00 00

And the output of "rpcdebug -m rpc -s all":

RPC:    61 call_refresh (status 0)
RPC:    61 call_refresh (status 0)
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
RPC:    61 call_refreshresult (status 0)
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
RPC:    61 call_refreshresult (status 0)
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
RPC:    61 call_refresh (status 0)
RPC:    61 call_refreshresult (status 0)
RPC:    61 call_refresh (status 0)
RPC:    61 call_refresh (status 0)
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
RPC:    61 call_refreshresult (status 0)
RPC:    61 call_refresh (status 0)
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
RPC:    61 call_refresh (status 0)
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0
RPC:    61 call_refreshresult (status 0)
RPC:    61 call_refresh (status 0)
RPC:    61 call_refresh (status 0)
RPC:    61 call_refresh (status 0)
RPC:    61 call_refresh (status 0)
RPC:    61 call_refreshresult (status 0)
RPC:    61 refreshing RPCSEC_GSS cred ffff88007a413cf0

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Cc: stable@vger.kernel.org # 2.6.37+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Jon Maloy says:

====================
tipc: link setup and failover improvements

This series consists of four unrelated commits with different purposes.

- Commit #1 is purely cosmetic and pedagogic, hopefully making the
  failover/tunneling logics slightly easier to understand.
- Commit #2 fixes a bug that has always been in the code, but was not
  discovered until very recently.
- Commit #3 fixes a non-fatal race issue in the neighbour discovery
  code.
- Commit #4 removes an unnecessary indirection step during link
  startup.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Avoid circular mutex lock by pushing the dev->lock to the .fini
callback on each extension.

As em28xx-dvb, em28xx-alsa and em28xx-rc have their own data
structures, and don't touch at the common structure during .fini,
only em28xx-v4l needs to be locked.

[   90.994317] ======================================================
[   90.994356] [ INFO: possible circular locking dependency detected ]
[   90.994395] 3.13.0-rc1+ #24 Not tainted
[   90.994427] -------------------------------------------------------
[   90.994458] khubd/54 is trying to acquire lock:
[   90.994490]  (&card->controls_rwsem){++++.+}, at: [<ffffffffa0177b08>] snd_ctl_dev_free+0x28/0x60 [snd]
[   90.994656]
[   90.994656] but task is already holding lock:
[   90.994688]  (&dev->lock){+.+.+.}, at: [<ffffffffa040db81>] em28xx_close_extension+0x31/0x90 [em28xx]
[   90.994843]
[   90.994843] which lock already depends on the new lock.
[   90.994843]
[   90.994874]
[   90.994874] the existing dependency chain (in reverse order) is:
[   90.994905]
-> #1 (&dev->lock){+.+.+.}:
[   90.995057]        [<ffffffff810b8fa3>] __lock_acquire+0xb43/0x1330
[   90.995121]        [<ffffffff810b9f82>] lock_acquire+0xa2/0x120
[   90.995182]        [<ffffffff816a5b6c>] mutex_lock_nested+0x5c/0x3c0
[   90.995245]        [<ffffffffa0422cca>] em28xx_vol_put_mute+0x1ba/0x1d0 [em28xx_alsa]
[   90.995309]        [<ffffffffa017813d>] snd_ctl_elem_write+0xfd/0x140 [snd]
[   90.995376]        [<ffffffffa01791c2>] snd_ctl_ioctl+0xe2/0x810 [snd]
[   90.995442]        [<ffffffff811db8b0>] do_vfs_ioctl+0x300/0x520
[   90.995504]        [<ffffffff811dbb51>] SyS_ioctl+0x81/0xa0
[   90.995568]        [<ffffffff816b1929>] system_call_fastpath+0x16/0x1b
[   90.995630]
-> #0 (&card->controls_rwsem){++++.+}:
[   90.995780]        [<ffffffff810b7a47>] check_prevs_add+0x947/0x950
[   90.995841]        [<ffffffff810b8fa3>] __lock_acquire+0xb43/0x1330
[   90.995901]        [<ffffffff810b9f82>] lock_acquire+0xa2/0x120
[   90.995962]        [<ffffffff816a762b>] down_write+0x3b/0xa0
[   90.996022]        [<ffffffffa0177b08>] snd_ctl_dev_free+0x28/0x60 [snd]
[   90.996088]        [<ffffffffa017a255>] snd_device_free+0x65/0x140 [snd]
[   90.996154]        [<ffffffffa017a751>] snd_device_free_all+0x61/0xa0 [snd]
[   90.996219]        [<ffffffffa0173af4>] snd_card_do_free+0x14/0x130 [snd]
[   90.996283]        [<ffffffffa0173f14>] snd_card_free+0x84/0x90 [snd]
[   90.996349]        [<ffffffffa0423397>] em28xx_audio_fini+0x97/0xb0 [em28xx_alsa]
[   90.996411]        [<ffffffffa040dba6>] em28xx_close_extension+0x56/0x90 [em28xx]
[   90.996475]        [<ffffffffa040f639>] em28xx_usb_disconnect+0x79/0x90 [em28xx]
[   90.996539]        [<ffffffff814a06e7>] usb_unbind_interface+0x67/0x1d0
[   90.996620]        [<ffffffff8142920f>] __device_release_driver+0x7f/0xf0
[   90.996682]        [<ffffffff814292a5>] device_release_driver+0x25/0x40
[   90.996742]        [<ffffffff81428b0c>] bus_remove_device+0x11c/0x1a0
[   90.996801]        [<ffffffff81425536>] device_del+0x136/0x1d0
[   90.996863]        [<ffffffff8149e0c0>] usb_disable_device+0xb0/0x290
[   90.996923]        [<ffffffff814930c5>] usb_disconnect+0xb5/0x1d0
[   90.996984]        [<ffffffff81495ab6>] hub_port_connect_change+0xd6/0xad0
[   90.997044]        [<ffffffff814967c3>] hub_events+0x313/0x9b0
[   90.997105]        [<ffffffff81496e95>] hub_thread+0x35/0x170
[   90.997165]        [<ffffffff8108ea2f>] kthread+0xff/0x120
[   90.997226]        [<ffffffff816b187c>] ret_from_fork+0x7c/0xb0
[   90.997287]
[   90.997287] other info that might help us debug this:
[   90.997287]
[   90.997318]  Possible unsafe locking scenario:
[   90.997318]
[   90.997348]        CPU0                    CPU1
[   90.997378]        ----                    ----
[   90.997408]   lock(&dev->lock);
[   90.997497]                                lock(&card->controls_rwsem);
[   90.997607]                                lock(&dev->lock);
[   90.997697]   lock(&card->controls_rwsem);
[   90.997786]
[   90.997786]  *** DEADLOCK ***
[   90.997786]
[   90.997817] 5 locks held by khubd/54:
[   90.997847]  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff81496564>] hub_events+0xb4/0x9b0
[   90.998025]  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff81493076>] usb_disconnect+0x66/0x1d0
[   90.998204]  #2:  (&__lockdep_no_validate__){......}, at: [<ffffffff8142929d>] device_release_driver+0x1d/0x40
[   90.998383]  #3:  (em28xx_devlist_mutex){+.+.+.}, at: [<ffffffffa040db77>] em28xx_close_extension+0x27/0x90 [em28xx]
[   90.998567]  #4:  (&dev->lock){+.+.+.}, at: [<ffffffffa040db81>] em28xx_close_extension+0x31/0x90 [em28xx]

Reviewed-by: Frank Schäfer <fschaefer.oss@googlemail.com>
Tested-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Fix broken inline assembly contraints for cmpxchg64 on 32bit.

Fixes this crash:

specification exception: 0006 [#1] SMP
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0 #4
task: 005a16c8 ti: 00592000 task.ti: 00592000
Krnl PSW : 070ce000 8029abd6 (lockref_get+0x3e/0x9c)
...
Krnl Code: 8029abcc: a71a0001           ahi     %r1,1
           8029abd0: 1852               lr      %r5,%r2
          #8029abd2: bb40f064           cds     %r4,%r0,100(%r15)
          >8029abd6: 1943               cr      %r4,%r3
           8029abd8: 1815               lr      %r1,%r5
Call Trace:
([<0000000078e01870>] 0x78e01870)
 [<000000000021105a>] sysfs_mount+0xd2/0x1c8
 [<00000000001b551e>] mount_fs+0x3a/0x134
 [<00000000001ce768>] vfs_kern_mount+0x44/0x11c
 [<00000000001ce864>] kern_mount_data+0x24/0x3c
 [<00000000005cc4b8>] sysfs_init+0x74/0xd4
 [<00000000005cb5b4>] mnt_init+0xe0/0x1fc
 [<00000000005cb16a>] vfs_caches_init+0xb6/0x14c
 [<00000000005be794>] start_kernel+0x318/0x33c
 [<000000000010001c>] _stext+0x1c/0x80

Reported-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
skb_kill_datagram() does not dequeue the skb when MSG_PEEK is unset.
This leaves a free'd skb on the queue, resulting a double-free later.

Without this, the following oops can occur:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff8154fcf7>] skb_dequeue+0x47/0x70
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: af_rxrpc ...
CPU: 0 PID: 1191 Comm: listen Not tainted 3.12.0+ #4
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8801183536b0 ti: ffff880035c92000 task.ti: ffff880035c92000
RIP: 0010:[<ffffffff8154fcf7>] skb_dequeue+0x47/0x70
RSP: 0018:ffff880035c93db8  EFLAGS: 00010097
RAX: 0000000000000246 RBX: ffff8800d2754b00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8800d254c084
RBP: ffff880035c93dd0 R08: ffff880035c93cf0 R09: ffff8800d968f270
R10: 0000000000000000 R11: 0000000000000293 R12: ffff8800d254c070
R13: ffff8800d254c084 R14: ffff8800cd861240 R15: ffff880119b39720
FS:  00007f37a969d740(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 00000000d4413000 CR4: 00000000000006f0
Stack:
 ffff8800d254c000 ffff8800d254c070 ffff8800d254c2c0 ffff880035c93df8
 ffffffffa041a5b8 ffff8800cd844c80 ffffffffa04385a0 ffff8800cd844cb0
 ffff880035c93e18 ffffffff81546cef ffff8800d45fea00 0000000000000008
Call Trace:
 [<ffffffffa041a5b8>] rxrpc_release+0x128/0x2e0 [af_rxrpc]
 [<ffffffff81546cef>] sock_release+0x1f/0x80
 [<ffffffff81546d62>] sock_close+0x12/0x20
 [<ffffffff811aaba1>] __fput+0xe1/0x230
 [<ffffffff811aad3e>] ____fput+0xe/0x10
 [<ffffffff810862cc>] task_work_run+0xbc/0xe0
 [<ffffffff8106a3be>] do_exit+0x2be/0xa10
 [<ffffffff8116dc47>] ? do_munmap+0x297/0x3b0
 [<ffffffff8106ab8f>] do_group_exit+0x3f/0xa0
 [<ffffffff8106ac04>] SyS_exit_group+0x14/0x20
 [<ffffffff8166b069>] system_call_fastpath+0x16/0x1b


Signed-off-by: Tim Smith <tim@electronghost.co.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Given now we have 2 spinlock for management of delayed refs,
CONFIG_DEBUG_SPINLOCK=y helped me find this,

[ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
[ 4723.414882]  lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2
[ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G        W  O 3.12.0+ #4
[ 4723.421321] Call Trace:
[ 4723.421872]  [<ffffffff81680fe7>] dump_stack+0x54/0x74
[ 4723.422753]  [<ffffffff81681093>] spin_dump+0x8c/0x91
[ 4723.424979]  [<ffffffff816810b9>] spin_bug+0x21/0x26
[ 4723.425846]  [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90
[ 4723.434424]  [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40
[ 4723.438747]  [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs]
[ 4723.443321]  [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs]
[ 4723.444692]  [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 4723.450336]  [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10
[ 4723.451332]  [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs]
[ 4723.452543]  [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
[ 4723.457833]  [<ffffffff81079efa>] kthread+0xea/0xf0
[ 4723.458990]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
[ 4723.460133]  [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
[ 4723.460865]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
[ 4723.496521] ------------[ cut here ]------------

----------------------------------------------------------------------

The reason is that we get to call cond_resched_lock(&head_ref->lock) while
still holding @delayed_refs->lock.

So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
dance before and after actually processing delayed refs.

Here we don't drop the lock, others are not able to add new delayed refs to
head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Guenter Roeck has got the following call trace on a p2020 board:
  Kernel stack overflow in process eb3e5a00, r1=eb79df90
  CPU: 0 PID: 2838 Comm: ssh Not tainted 3.13.0-rc8-juniper-00146-g19eca00 #4
  task: eb3e5a00 ti: c0616000 task.ti: ef440000
  NIP: c003a420 LR: c003a410 CTR: c0017518
  REGS: eb79dee0 TRAP: 0901   Not tainted (3.13.0-rc8-juniper-00146-g19eca00)
  MSR: 00029000 <CE,EE,ME>  CR: 24008444  XER: 00000000
  GPR00: c003a410 eb79df90 eb3e5a00 00000000 eb05d900 00000001 65d87646 00000000
  GPR08: 00000000 020b8000 00000000 00000000 44008442
  NIP [c003a420] __do_softirq+0x94/0x1ec
  LR [c003a410] __do_softirq+0x84/0x1ec
  Call Trace:
  [eb79df90] [c003a410] __do_softirq+0x84/0x1ec (unreliable)
  [eb79dfe0] [c003a970] irq_exit+0xbc/0xc8
  [eb79dff0] [c000cc1c] call_do_irq+0x24/0x3c
  [ef441f20] [c00046a8] do_IRQ+0x8c/0xf8
  [ef441f40] [c000e7f4] ret_from_except+0x0/0x18
  --- Exception: 501 at 0xfcda524
      LR = 0x10024900
  Instruction dump:
  7c781b78 3b40000a 3a73b040 543c0024 3a800000 3b3913a0 7ef5bb78 48201bf9
  5463103a 7d3b182e 7e89b92e 7c008146 <3ba00000> 7e7e9b78 48000014 57fff87f
  Kernel panic - not syncing: kernel stack overflow
  CPU: 0 PID: 2838 Comm: ssh Not tainted 3.13.0-rc8-juniper-00146-g19eca00 #4
  Call Trace:

The reason is that we have used the wrong register to calculate the
ksp_limit in commit cbc9565 (powerpc: Remove ksp_limit on ppc64).
Just fix it.

As suggested by Benjamin Herrenschmidt, also add the C prototype of the
function in the comment in order to avoid such kind of errors in the
future.

Cc: stable@vger.kernel.org # 3.12
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Currently we're using GFP_KERNEL, however there are some path(s) where we
can hold some spinlocks, specifically bond->curr_slave_lock:

[    4.722916] BUG: sleeping function called from invalid context at mm/slub.c:965
[    4.724438] in_atomic(): 1, irqs_disabled(): 0, pid: 940, name: ifup-eth
[    4.726034] 5 locks held by ifup-eth/940:
...snip...
[    4.734646]  #4:  (&bond->curr_slave_lock){+...+.}, at: [<ffffffffa00badc6>] bond_enslave+0xda6/0xdd0 [bonding]
...snip...
[    4.759081]  [<ffffffffa00b6f11>] bond_change_active_slave+0x191/0x3b0 [bonding]
[    4.760917]  [<ffffffffa00b7227>] bond_select_active_slave+0xf7/0x1d0 [bonding]
[    4.762751]  [<ffffffffa00badce>] bond_enslave+0xdae/0xdd0 [bonding]
...snip...

As it's out of hot path and is a really rare event - change the gfp_t flags
to GFP_ATOMIC to avoid sleeping under spinlock.

v2: convert new notify calls to GFP_ATOMIC.

CC: Thomas Glanzmann <thomas@glanzmann.de>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
Zoltan Kiss says:

====================
xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy

A long known problem of the upstream netback implementation that on the TX
path (from guest to Dom0) it copies the whole packet from guest memory into
Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
huge perfomance penalty. The classic kernel version of netback used grant
mapping, and to get notified when the page can be unmapped, it used page
destructors. Unfortunately that destructor is not an upstreamable solution.
Ian Campbell's skb fragment destructor patch series [1] tried to solve this
problem, however it seems to be very invasive on the network stack's code,
and therefore haven't progressed very well.
This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
know when the skb is freed up. That is the way KVM solved the same problem,
and based on my initial tests it can do the same for us. Avoiding the extra
copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
switch)
Based on my investigations the packet get only copied if it is delivered to
Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
unfortunate, but luckily it doesn't cause a major regression for this usecase.
In the future we should try to eliminate that copy somehow.
There are a few spinoff tasks which will be addressed in separate patches:
- grant copy the header directly instead of map and memcpy. This should help
  us avoiding TLB flushing
- use something else than ballooned pages
- fix grant map to use page->index properly
I've tried to broke it down to smaller patches, with mixed results, so I
welcome suggestions on that part as well:
1: Use skb->cb to store pending_idx
2: Some refactoring
3: Change RX path for mapped SKB fragments (moved here to keep bisectability,
review it after #4)
4: Introduce TX grant mapping
5: Remove old TX grant copy definitons and fix indentations
6: Add stat counters for zerocopy
7: Handle guests with too many frags
8: Timeout packets in RX path
9: Aggregate TX unmap operations

v2: I've fixed some smaller things, see the individual patches. I've added a
few new stat counters, and handling the important use case when an older guest
sends lots of slots. Instead of delayed copy now we timeout packets on the RX
path, based on the assumption that otherwise packets should get stucked
anywhere else. Finally some unmap batching to avoid too much TLB flush

v3: Apart from fixing a few things mentioned in responses the important change
is the use the hypercall directly for grant [un]mapping, therefore we can
avoid m2p override.

v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue
timeout logic changed also.

v5: Only minor fixes based on Wei's comments

v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see
first 2 patches. Also rework of handling packets with too many slots, and
reorder the series a bit.

v7: Small fixes in comments/log messages/error paths, and merging the frag
overflow stats patch into its parent.

[1] http://lwn.net/Articles/491522/
[2] https://lkml.org/lkml/2012/7/20/363
====================

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
[13654.480669] ======================================================
[13654.480905] [ INFO: possible circular locking dependency detected ]
[13654.481003] 3.12.0+ #4 Tainted: G        W  O
[13654.481060] -------------------------------------------------------
[13654.481060] btrfs-transacti/9347 is trying to acquire lock:
[13654.481060]  (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060] but task is already holding lock:
[13654.481060]  (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
[13654.481060] which lock already depends on the new lock.

[13654.481060] the existing dependency chain (in reverse order) is:
[13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}:
[13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
[13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
[13654.481060]        [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs]
[13654.481060]        [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs]
[13654.481060]        [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs]
[13654.481060]        [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs]
[13654.481060]        [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs]
[13654.481060]        [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs]
[13654.481060]        [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs]
[13654.481060]        [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs]
[13654.481060]        [<ffffffff8114be91>] do_writepages+0x21/0x50
[13654.481060]        [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60
[13654.481060]        [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20
[13654.481060]        [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs]
[13654.481060]        [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs]
[13654.481060]        [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
[13654.481060]        [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs]
[13654.481060]        [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs]
[13654.481060]        [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs]
[13654.481060]        [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs]
[13654.481060]        [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs]
[13654.481060]        [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs]
[13654.481060]        [<ffffffff811b2409>] mount_fs+0x39/0x1b0
[13654.481060]        [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0
[13654.481060]        [<ffffffff811cfcce>] do_mount+0x23e/0xa90
[13654.481060]        [<ffffffff811d05a3>] SyS_mount+0x83/0xc0
[13654.481060]        [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b
[13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}:
[13654.481060]        [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70
[13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
[13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
[13654.481060]        [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060]        [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs]
[13654.481060]        [<ffffffff81079efa>] kthread+0xea/0xf0
[13654.481060]        [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
[13654.481060] other info that might help us debug this:

[13654.481060]  Possible unsafe locking scenario:

[13654.481060]        CPU0                    CPU1
[13654.481060]        ----                    ----
[13654.481060]   lock(&(&fs_info->ordered_root_lock)->rlock);
[13654.481060]				 lock(&(&root->ordered_extent_lock)->rlock);
[13654.481060]				 lock(&(&fs_info->ordered_root_lock)->rlock);
[13654.481060]   lock(&(&root->ordered_extent_lock)->rlock);
[13654.481060]
 *** DEADLOCK ***
[...]

======================================================

btrfs_destroy_all_ordered_extents()
gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock,
while btrfs_[add,remove]_ordered_extent()
acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock.

This patch fixes the above problem.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
vmxnet3's netpoll driver is incorrectly coded.  It directly calls
vmxnet3_do_poll, which is the driver internal napi poll routine.  As the netpoll
controller method doesn't block real napi polls in any way, there is a potential
for race conditions in which the netpoll controller method and the napi poll
method run concurrently.  The result is data corruption causing panics such as this
one recently observed:
PID: 1371   TASK: ffff88023762caa0  CPU: 1   COMMAND: "rs:main Q:Reg"
 #0 [ffff88023abd5780] machine_kexec at ffffffff81038f3b
 #1 [ffff88023abd57e0] crash_kexec at ffffffff810c5d92
 #2 [ffff88023abd58b0] oops_end at ffffffff8152b570
 #3 [ffff88023abd58e0] die at ffffffff81010e0b
 #4 [ffff88023abd5910] do_trap at ffffffff8152add4
 #5 [ffff88023abd5970] do_invalid_op at ffffffff8100cf95
 #6 [ffff88023abd5a10] invalid_op at ffffffff8100bf9b
    [exception RIP: vmxnet3_rq_rx_complete+1968]
    RIP: ffffffffa00f1e80  RSP: ffff88023abd5ac8  RFLAGS: 00010086
    RAX: 0000000000000000  RBX: ffff88023b5dcee0  RCX: 00000000000000c0
    RDX: 0000000000000000  RSI: 00000000000005f2  RDI: ffff88023b5dcee0
    RBP: ffff88023abd5b48   R8: 0000000000000000   R9: ffff88023a3b6048
    R10: 0000000000000000  R11: 0000000000000002  R12: ffff8802398d4cd8
    R13: ffff88023af35140  R14: ffff88023b60c890  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88023abd5b50] vmxnet3_do_poll at ffffffffa00f204a [vmxnet3]
 #8 [ffff88023abd5b80] vmxnet3_netpoll at ffffffffa00f209c [vmxnet3]
 #9 [ffff88023abd5ba0] netpoll_poll_dev at ffffffff81472bb7

The fix is to do as other drivers do, and have the poll controller call the top
half interrupt handler, which schedules a napi poll properly to recieve frames

Tested by myself, successfully.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Shreyas Bhatewara <sbhatewara@vmware.com>
CC: "VMware, Inc." <pv-drivers@vmware.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: stable@vger.kernel.org
Reviewed-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
…to next/dt

Merge "mvebu dt changes for v3.15 (incremental #4)" from Jason Cooper:

 - dove
    - add system controller node
    - drop pinctrl PMU reg property _before_ it hits mainline and becomes ABI

 - mvebu
    - XP/370
       - change default PCIe apertures
       - switch GP and DB boards internal registers to 0xf1000000
       - correct RAM size on Matrix board
    - 385
       - correct phy connection type for DB board
       - add RD board

* tag 'mvebu-dt-3.15-4' of git://git.infradead.org/linux-mvebu:
  ARM: dove: drop pinctrl PMU reg property
  ARM: mvebu: add Device Tree for the Armada 385 RD board
  ARM: mvebu: use the correct phy connection mode on Armada 385 DB
  ARM: mvebu: the Armada XP Matrix board has 4 GB
  ARM: mvebu: switch the Armada XP GP to use internal registers at 0xf1000000
  ARM: mvebu: switch the Armada XP DB to use internal registers at 0xf1000000
  ARM: mvebu: change the default PCIe apertures for Armada 370/XP
  ARM: dove: add system controller node

Conflicts:
	arch/arm/boot/dts/Makefile

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
I'm transitioning maintainership of the xHCI driver to my colleague,
Mathias Nyman.  The xHCI driver is in good shape, and it's time for me
to move on to the next shiny thing. :)

There's a few known outstanding bugs that we have plans for how to fix:

1. Clear Halt issue that means some USB scanners fail after one scan
2. TD fragment issue that means USB ethernet scatter-gather doesn't work
3. xHCI command queue issues that cause the driver to die when a USB
   device doesn't respond to a Set Address control transfer when another
   command is outstanding.
4. USB port power off for Haswell-ULT is a complete disaster.

Mathias is putting the finishing touches on a fix for #3, which will
make it much easier to craft a solution for #1.  Dan William has an
ACKed RFC for #4 that may land in 3.16, after much testing.  I'm working
with Mathias to come up with an architectural solution for #2.

I don't foresee very many big features coming down the pipe for USB
(which is part of the reason it's a good time to change now).  SSIC is
mostly a hardware-level change (perhaps with some PHY drivers needed),
USB 3.1 is again mostly a hardware-level change with some software
engineering to communicate the speed increase to the device drivers, add
new device descriptor parsing to lsusb, but definitely nothing as big as
USB 3.0 was.

Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Signed-off-by: Mathias Nyman <mathias.nyman@intel.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
There are two problematic situations.

A deadlock can happen when is_percpu is false because it can get
interrupted while holding the spinlock. Then it executes
ovs_flow_stats_update() in softirq context which tries to get
the same lock.

The second sitation is that when is_percpu is true, the code
correctly disables BH but only for the local CPU, so the
following can happen when locking the remote CPU without
disabling BH:

       CPU#0                            CPU#1
  ovs_flow_stats_get()
   stats_read()
 +->spin_lock remote CPU#1        ovs_flow_stats_get()
 |  <interrupted>                  stats_read()
 |  ...                       +-->  spin_lock remote CPU#0
 |                            |     <interrupted>
 |  ovs_flow_stats_update()   |     ...
 |   spin_lock local CPU#0 <--+     ovs_flow_stats_update()
 +---------------------------------- spin_lock local CPU#1

This patch disables BH for both cases fixing the deadlocks.
Acked-by: Jesse Gross <jesse@nicira.com>

=================================
[ INFO: inconsistent lock state ]
3.14.0-rc8-00007-g632b06a #1 Tainted: G          I
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/0/0 [HC0[0]:SC1[5]:HE1:SE0] takes:
(&(&cpu_stats->lock)->rlock){+.?...}, at: [<ffffffffa05dd8a1>] ovs_flow_stats_update+0x51/0xd0 [openvswitch]
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810f973f>] __lock_acquire+0x68f/0x1c40
[<ffffffff810fb4e2>] lock_acquire+0xa2/0x1d0
[<ffffffff817d8d9e>] _raw_spin_lock+0x3e/0x80
[<ffffffffa05dd9e4>] ovs_flow_stats_get+0xc4/0x1e0 [openvswitch]
[<ffffffffa05da855>] ovs_flow_cmd_fill_info+0x185/0x360 [openvswitch]
[<ffffffffa05daf05>] ovs_flow_cmd_build_info.constprop.27+0x55/0x90 [openvswitch]
[<ffffffffa05db41d>] ovs_flow_cmd_new_or_set+0x4dd/0x570 [openvswitch]
[<ffffffff816c245d>] genl_family_rcv_msg+0x1cd/0x3f0
[<ffffffff816c270e>] genl_rcv_msg+0x8e/0xd0
[<ffffffff816c0239>] netlink_rcv_skb+0xa9/0xc0
[<ffffffff816c0798>] genl_rcv+0x28/0x40
[<ffffffff816bf830>] netlink_unicast+0x100/0x1e0
[<ffffffff816bfc57>] netlink_sendmsg+0x347/0x770
[<ffffffff81668e9c>] sock_sendmsg+0x9c/0xe0
[<ffffffff816692d9>] ___sys_sendmsg+0x3a9/0x3c0
[<ffffffff8166a911>] __sys_sendmsg+0x51/0x90
[<ffffffff8166a962>] SyS_sendmsg+0x12/0x20
[<ffffffff817e3ce9>] system_call_fastpath+0x16/0x1b
irq event stamp: 1740726
hardirqs last  enabled at (1740726): [<ffffffff8175d5e0>] ip6_finish_output2+0x4f0/0x840
hardirqs last disabled at (1740725): [<ffffffff8175d59b>] ip6_finish_output2+0x4ab/0x840
softirqs last  enabled at (1740674): [<ffffffff8109be12>] _local_bh_enable+0x22/0x50
softirqs last disabled at (1740675): [<ffffffff8109db05>] irq_exit+0xc5/0xd0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&cpu_stats->lock)->rlock);
  <Interrupt>
    lock(&(&cpu_stats->lock)->rlock);

 *** DEADLOCK ***

5 locks held by swapper/0/0:
 #0:  (((&ifa->dad_timer))){+.-...}, at: [<ffffffff810a7155>] call_timer_fn+0x5/0x320
 #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81788a55>] mld_sendpack+0x5/0x4a0
 #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8175d149>] ip6_finish_output2+0x59/0x840
 #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8168ba75>] __dev_queue_xmit+0x5/0x9b0
 #4:  (rcu_read_lock){.+.+..}, at: [<ffffffffa05e41b5>] internal_dev_xmit+0x5/0x110 [openvswitch]

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G          I  3.14.0-rc8-00007-g632b06a #1
Hardware name:                  /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
 0000000000000000 0fcf20709903df0c ffff88042d603808 ffffffff817cfe3c
 ffffffff81c134c0 ffff88042d603858 ffffffff817cb6da 0000000000000005
 ffffffff00000001 ffff880400000000 0000000000000006 ffffffff81c134c0
Call Trace:
 <IRQ>  [<ffffffff817cfe3c>] dump_stack+0x4d/0x66
 [<ffffffff817cb6da>] print_usage_bug+0x1f4/0x205
 [<ffffffff810f7f10>] ? check_usage_backwards+0x180/0x180
 [<ffffffff810f8963>] mark_lock+0x223/0x2b0
 [<ffffffff810f96d3>] __lock_acquire+0x623/0x1c40
 [<ffffffff810f5707>] ? __lock_is_held+0x57/0x80
 [<ffffffffa05e26c6>] ? masked_flow_lookup+0x236/0x250 [openvswitch]
 [<ffffffff810fb4e2>] lock_acquire+0xa2/0x1d0
 [<ffffffffa05dd8a1>] ? ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffff817d8d9e>] _raw_spin_lock+0x3e/0x80
 [<ffffffffa05dd8a1>] ? ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffffa05dd8a1>] ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffffa05dcc64>] ovs_dp_process_received_packet+0x84/0x120 [openvswitch]
 [<ffffffff810f93f7>] ? __lock_acquire+0x347/0x1c40
 [<ffffffffa05e3bea>] ovs_vport_receive+0x2a/0x30 [openvswitch]
 [<ffffffffa05e4218>] internal_dev_xmit+0x68/0x110 [openvswitch]
 [<ffffffffa05e41b5>] ? internal_dev_xmit+0x5/0x110 [openvswitch]
 [<ffffffff8168b4a6>] dev_hard_start_xmit+0x2e6/0x8b0
 [<ffffffff8168be87>] __dev_queue_xmit+0x417/0x9b0
 [<ffffffff8168ba75>] ? __dev_queue_xmit+0x5/0x9b0
 [<ffffffff8175d5e0>] ? ip6_finish_output2+0x4f0/0x840
 [<ffffffff8168c430>] dev_queue_xmit+0x10/0x20
 [<ffffffff8175d641>] ip6_finish_output2+0x551/0x840
 [<ffffffff8176128a>] ? ip6_finish_output+0x9a/0x220
 [<ffffffff8176128a>] ip6_finish_output+0x9a/0x220
 [<ffffffff8176145f>] ip6_output+0x4f/0x1f0
 [<ffffffff81788c29>] mld_sendpack+0x1d9/0x4a0
 [<ffffffff817895b8>] mld_send_initial_cr.part.32+0x88/0xa0
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff8178e301>] ipv6_mc_dad_complete+0x31/0x50
 [<ffffffff817690d7>] addrconf_dad_completed+0x147/0x220
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff8176934f>] addrconf_dad_timer+0x19f/0x1c0
 [<ffffffff810a71e9>] call_timer_fn+0x99/0x320
 [<ffffffff810a7155>] ? call_timer_fn+0x5/0x320
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff810a76c4>] run_timer_softirq+0x254/0x3b0
 [<ffffffff8109d47d>] __do_softirq+0x12d/0x480

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
virt_to_page() is incredibly inefficient when virt-to-phys patching is
enabled.  This is because we end up with this calculation:

  page = &mem_map[asm virt_to_phys(addr) >> 12 - __pv_phys_offset >> 12]

in assembly.  The asm virt_to_phys() is equivalent this this operation:

  addr - PAGE_OFFSET + __pv_phys_offset

and we can see that because this is assembly, the compiler has no chance
to optimise some of that away.  This should reduce down to:

  page = &mem_map[(addr - PAGE_OFFSET) >> 12]

for the common cases.  Permit the compiler to make this optimisation by
giving it more of the information it needs - do this by providing a
virt_to_pfn() macro.

Another issue which makes this more complex is that __pv_phys_offset is
a 64-bit type on all platforms.  This is needlessly wasteful - if we
store the physical offset as a PFN, we can save a lot of work having
to deal with 64-bit values, which sometimes ends up producing incredibly
horrid code:

     a4c:       e3009000        movw    r9, #0
                        a4c: R_ARM_MOVW_ABS_NC  __pv_phys_offset
     a50:       e3409000        movt    r9, #0          ; r9 = &__pv_phys_offset
                        a50: R_ARM_MOVT_ABS     __pv_phys_offset
     a54:       e3002000        movw    r2, #0
                        a54: R_ARM_MOVW_ABS_NC  __pv_phys_offset
     a58:       e3402000        movt    r2, #0          ; r2 = &__pv_phys_offset
                        a58: R_ARM_MOVT_ABS     __pv_phys_offset
     a5c:       e5999004        ldr     r9, [r9, #4]    ; r9 = high word of __pv_phys_offset
     a60:       e3001000        movw    r1, #0
                        a60: R_ARM_MOVW_ABS_NC  mem_map
     a64:       e592c000        ldr     ip, [r2]        ; ip = low word of __pv_phys_offset

Reviewed-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.

And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
  because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
  number of clusters returned by ext4_free_clusters_after_init() is not correct.
  (see ext4_bg_num_gdb_nometa())

Here is the stack backtrace:

(gdb) bt
 #0  ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
 #1  ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
     desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
 #2  0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
 #3  0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0) at balloc.c:489
 #4  0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
 #5  0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
     sb=0xffff880079a10000) at super.c:2143
 #6  ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
     data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
     ...

Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
There was a deadlock in monitor mode when we were setting the
channel if the channel was not 1.

======================================================
[ INFO: possible circular locking dependency detected ]
3.14.3 #4 Not tainted
-------------------------------------------------------
iw/3323 is trying to acquire lock:
 (&local->chanctx_mtx){+.+.+.}, at: [<ffffffffa062e2f2>] ieee80211_vif_release_channel+0x42/0xb0 [mac80211]

but task is already holding lock:
 (&local->iflist_mtx){+.+...}, at: [<ffffffffa0609e0a>] ieee80211_set_monitor_channel+0x5a/0x1b0 [mac80211]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&local->iflist_mtx){+.+...}:
       [<ffffffff810d95bb>] __lock_acquire+0xb3b/0x13b0
       [<ffffffff810d9ee0>] lock_acquire+0xb0/0x1f0
       [<ffffffff817eb9c8>] mutex_lock_nested+0x78/0x4f0
       [<ffffffffa06225cf>] ieee80211_iterate_active_interfaces+0x2f/0x60 [mac80211]
       [<ffffffffa0518189>] iwl_mvm_recalc_multicast+0x49/0xa0 [iwlmvm]
       [<ffffffffa051822e>] iwl_mvm_configure_filter+0x4e/0x70 [iwlmvm]
       [<ffffffffa05e6d43>] ieee80211_configure_filter+0x153/0x5f0 [mac80211]
       [<ffffffffa05e71f5>] ieee80211_reconfig_filter+0x15/0x20 [mac80211]
       [snip]

-> #1 (&mvm->mutex){+.+.+.}:
       [<ffffffff810d95bb>] __lock_acquire+0xb3b/0x13b0
       [<ffffffff810d9ee0>] lock_acquire+0xb0/0x1f0
       [<ffffffff817eb9c8>] mutex_lock_nested+0x78/0x4f0
       [<ffffffffa0517246>] iwl_mvm_add_chanctx+0x56/0xe0 [iwlmvm]
       [<ffffffffa062ca1e>] ieee80211_new_chanctx+0x13e/0x410 [mac80211]
       [<ffffffffa062d953>] ieee80211_vif_use_channel+0x1c3/0x5a0 [mac80211]
       [<ffffffffa06035ab>] ieee80211_add_virtual_monitor+0x1ab/0x6b0 [mac80211]
       [<ffffffffa06052ea>] ieee80211_do_open+0xe6a/0x15a0 [mac80211]
       [<ffffffffa0605a79>] ieee80211_open+0x59/0x60 [mac80211]
       [snip]

-> #0 (&local->chanctx_mtx){+.+.+.}:
       [<ffffffff810d6cb7>] check_prevs_add+0x977/0x980
       [<ffffffff810d95bb>] __lock_acquire+0xb3b/0x13b0
       [<ffffffff810d9ee0>] lock_acquire+0xb0/0x1f0
       [<ffffffff817eb9c8>] mutex_lock_nested+0x78/0x4f0
       [<ffffffffa062e2f2>] ieee80211_vif_release_channel+0x42/0xb0 [mac80211]
       [<ffffffffa0609ec3>] ieee80211_set_monitor_channel+0x113/0x1b0 [mac80211]
       [<ffffffffa058fb37>] cfg80211_set_monitor_channel+0x77/0x2b0 [cfg80211]
       [<ffffffffa056e0b2>] __nl80211_set_channel+0x122/0x140 [cfg80211]
       [<ffffffffa0581374>] nl80211_set_wiphy+0x284/0xaf0 [cfg80211]
       [snip]

other info that might help us debug this:

Chain exists of:
  &local->chanctx_mtx --> &mvm->mutex --> &local->iflist_mtx

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&local->iflist_mtx);
                               lock(&mvm->mutex);
                               lock(&local->iflist_mtx);
  lock(&local->chanctx_mtx);

 *** DEADLOCK ***

This deadlock actually occurs:
INFO: task iw:3323 blocked for more than 120 seconds.
      Not tainted 3.14.3 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iw              D ffff8800c8afcd80  4192  3323   3322 0x00000000
 ffff880078fdb7e0 0000000000000046 ffff8800c8afcd80 ffff880078fdbfd8
 00000000001d5540 00000000001d5540 ffff8801141b0000 ffff8800c8afcd80
 ffff880078ff9e38 ffff880078ff9e38 ffff880078ff9e40 0000000000000246
Call Trace:
 [<ffffffff817ea841>] schedule_preempt_disabled+0x31/0x80
 [<ffffffff817ebaed>] mutex_lock_nested+0x19d/0x4f0
 [<ffffffffa06225cf>] ? ieee80211_iterate_active_interfaces+0x2f/0x60 [mac80211]
 [<ffffffffa06225cf>] ? ieee80211_iterate_active_interfaces+0x2f/0x60 [mac80211]
 [<ffffffffa052a680>] ? iwl_mvm_power_mac_update_mode+0xc0/0xc0 [iwlmvm]
 [<ffffffffa06225cf>] ieee80211_iterate_active_interfaces+0x2f/0x60 [mac80211]
 [<ffffffffa0529357>] _iwl_mvm_power_update_binding+0x27/0x80 [iwlmvm]
 [<ffffffffa0516eb1>] iwl_mvm_unassign_vif_chanctx+0x81/0xc0 [iwlmvm]
 [<ffffffffa062d3ff>] __ieee80211_vif_release_channel+0xdf/0x470 [mac80211]
 [<ffffffffa062e2fa>] ieee80211_vif_release_channel+0x4a/0xb0 [mac80211]
 [<ffffffffa0609ec3>] ieee80211_set_monitor_channel+0x113/0x1b0 [mac80211]
 [<ffffffffa058fb37>] cfg80211_set_monitor_channel+0x77/0x2b0 [cfg80211]
 [<ffffffffa056e0b2>] __nl80211_set_channel+0x122/0x140 [cfg80211]
 [<ffffffffa0581374>] nl80211_set_wiphy+0x284/0xaf0 [cfg80211]

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=75541

Cc: <stable@vger.kernel.org> [3.13+]
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Quarx2k pushed a commit that referenced this pull request May 29, 2014
After 96d365e ("cgroup: make css_set_lock a rwsem and rename it
to css_set_rwsem"), css task iterators requires sleepable context as
it may block on css_set_rwsem.  I missed that cgroup_freezer was
iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
errors like the following on freezer state reads and transitions.

  BUG: sleeping function called from invalid context at /work
 /os/work/kernel/locking/rwsem.c:20
  in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
  5 locks held by bash/462:
   #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
   #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
   #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
   #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
   #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
  Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460

  CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
   ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
   0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
  Call Trace:
   [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
   [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
   [<ffffffff81d05974>] down_read+0x24/0x60
   [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
   [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
   [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
   [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
   [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
   [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
   [<ffffffff811f121d>] SyS_write+0x4d/0xc0
   [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b

freezer->lock used to be used in hot paths but that time is long gone
and there's no reason for the lock to be IRQ-safe spinlock or even
per-cgroup.  In fact, given the fact that a cgroup may contain large
number of tasks, it's not a good idea to iterate over them while
holding IRQ-safe spinlock.

Let's simplify locking by replacing per-cgroup freezer->lock with
global freezer_mutex.  This also makes the comments explaining the
intricacies of policy inheritance and the locking around it as the
states are protected by a common mutex.

The conversion is mostly straight-forward.  The followings are worth
mentioning.

* freezer_css_online() no longer needs double locking.

* freezer_attach() now performs propagation simply while holding
  freezer_mutex.  update_if_frozen() race no longer exists and the
  comment is removed.

* freezer_fork() now tests whether the task is in root cgroup using
  the new task_css_is_root() without doing rcu_read_lock/unlock().  If
  not, it grabs freezer_mutex and performs the operation.

* freezer_read() and freezer_change_state() grab freezer_mutex across
  the whole operation and pin the css while iterating so that each
  descendant processing happens in sleepable context.

Fixes: 96d365e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Quarx2k pushed a commit that referenced this pull request Jun 25, 2014
Currently, the 32-bit and 64-bit atomic operations on ARM do not
include memory constraints in the inline assembly blocks. In the
case of barrier-less operations [for example, atomic_add], this
means that the compiler may constant fold values which have actually
been modified by a call to an atomic operation.

This issue can be observed in the atomic64_test routine in
<kernel root>/lib/atomic64_test.c:

00000000 <test_atomic64>:
   0:	e1a0c00d 	mov	ip, sp
   4:	e92dd830 	push	{r4, r5, fp, ip, lr, pc}
   8:	e24cb004 	sub	fp, ip, #4
   c:	e24dd008 	sub	sp, sp, #8
  10:	e24b3014 	sub	r3, fp, #20
  14:	e30d000d 	movw	r0, #53261	; 0xd00d
  18:	e3011337 	movw	r1, #4919	; 0x1337
  1c:	e34c0001 	movt	r0, #49153	; 0xc001
  20:	e34a1aa3 	movt	r1, #43683	; 0xaaa3
  24:	e16300f8 	strd	r0, [r3, #-8]!
  28:	e30c0afe 	movw	r0, #51966	; 0xcafe
  2c:	e30b1eef 	movw	r1, #48879	; 0xbeef
  30:	e34d0eaf 	movt	r0, #57007	; 0xdeaf
  34:	e34d1ead 	movt	r1, #57005	; 0xdead
  38:	e1b34f9f 	ldrexd	r4, [r3]
  3c:	e1a34f90 	strexd	r4, r0, [r3]
  40:	e3340000 	teq	r4, #0
  44:	1afffffb 	bne	38 <test_atomic64+0x38>
  48:	e59f0004 	ldr	r0, [pc, #4]	; 54 <test_atomic64+0x54>
  4c:	e3a0101e 	mov	r1, #30
  50:	ebfffffe 	bl	0 <__bug>
  54:	00000000 	.word	0x00000000

The atomic64_set (0x38-0x44) writes to the atomic64_t, but the
compiler doesn't see this, assumes the test condition is always
false and generates an unconditional branch to __bug. The rest of the
test is optimised away.

This patch adds suitable memory constraints to the atomic operations on ARM
to ensure that the compiler is informed of the correct data hazards. We have
to use the "Qo" constraints to avoid hitting the GCC anomaly described at
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44492 , where the compiler
makes assumptions about the writeback in the addressing mode used by the
inline assembly. These constraints forbid the use of auto{inc,dec} addressing
modes, so it doesn't matter if we don't use the operand exactly once.

Cc: stable@kernel.org
Reviewed-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Quarx2k pushed a commit that referenced this pull request Nov 8, 2014
…optimizations

Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on
assumptions about the implementation of memset and similar functions.
The current ARM optimized memset code does not return the value of
its first argument, as is usually expected from standard implementations.

For instance in the following function:

void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
{
	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
	waiter->magic = waiter;
	INIT_LIST_HEAD(&waiter->list);
}

compiled as:

800554d0 <debug_mutex_lock_common>:
800554d0:       e92d4008        push    {r3, lr}
800554d4:       e1a00001        mov     r0, r1
800554d8:       e3a02010        mov     r2, #16 ; 0x10
800554dc:       e3a01011        mov     r1, #17 ; 0x11
800554e0:       eb04426e        bl      80165ea0 <memset>
800554e4:       e1a03000        mov     r3, r0
800554e8:       e583000c        str     r0, [r3, #12]
800554ec:       e5830000        str     r0, [r3]
800554f0:       e5830004        str     r0, [r3, #4]
800554f4:       e8bd8008        pop     {r3, pc}

GCC assumes memset returns the value of pointer 'waiter' in register r0; causing
register/memory corruptions.

This patch fixes the return value of the assembly version of memset.
It adds a 'mov' instruction and merges an additional load+store into
existing load/store instructions.
For ease of review, here is a breakdown of the patch into 4 simple steps:

Step 1
======
Perform the following substitutions:
ip -> r8, then
r0 -> ip,
and insert 'mov ip, r0' as the first statement of the function.
At this point, we have a memset() implementation returning the proper result,
but corrupting r8 on some paths (the ones that were using ip).

Step 2
======
Make sure r8 is saved and restored when (! CALGN(1)+0) == 1:

save r8:
-       str     lr, [sp, #-4]!
+       stmfd   sp!, {r8, lr}

and restore r8 on both exit paths:
-       ldmeqfd sp!, {pc}               @ Now <64 bytes to go.
+       ldmeqfd sp!, {r8, pc}           @ Now <64 bytes to go.
(...)
        tst     r2, #16
        stmneia ip!, {r1, r3, r8, lr}
-       ldr     lr, [sp], #4
+       ldmfd   sp!, {r8, lr}

Step 3
======
Make sure r8 is saved and restored when (! CALGN(1)+0) == 0:

save r8:
-       stmfd   sp!, {r4-r7, lr}
+       stmfd   sp!, {r4-r8, lr}

and restore r8 on both exit paths:
        bgt     3b
-       ldmeqfd sp!, {r4-r7, pc}
+       ldmeqfd sp!, {r4-r8, pc}
(...)
        tst     r2, #16
        stmneia ip!, {r4-r7}
-       ldmfd   sp!, {r4-r7, lr}
+       ldmfd   sp!, {r4-r8, lr}

Step 4
======
Rewrite register list "r4-r7, r8" as "r4-r8".

Signed-off-by: Ivan Djelic <ivan.djelic@parrot.com>
Reviewed-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Dirk Behme <dirk.behme@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Quarx2k pushed a commit that referenced this pull request Nov 12, 2014
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.

It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.

Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.

Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.

Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.

Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.

Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.

This patch:

Make subsys[] able to be dynamically populated to support modular
subsystems

This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ryncsn pushed a commit to ryncsn/jordan-kernel that referenced this pull request Mar 31, 2015
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure.  This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy.  Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:

1.	Kernel built for more than 32 CPUs on 32-bit systems or for more
	than 64 CPUs on 64-bit systems, so that there is more than one
	rcu_node structure.  (Or CONFIG_RCU_FANOUT is artificially set
	to a number smaller than CONFIG_NR_CPUS.)

2.	The kernel is built with CONFIG_TREE_PREEMPT_RCU.

3.	A task running on a CPU associated with a given leaf rcu_node
	structure blocks while in an RCU read-side critical section
	-and- that CPU has not yet passed through a quiescent state
	for the current RCU grace period.  This will cause the task
	to be queued on the leaf rcu_node's blocked_tasks[] array, in
	particular, on the element of this array corresponding to the
	current grace period.

4.	Each of the remaining CPUs corresponding to this same leaf rcu_node
	structure pass through a quiescent state.  However, the task is
	still in its RCU read-side critical section, so these quiescent
	states cannot be reported further up the rcu_node hierarchy.
	Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
	field are now zero.

5.	Each of the remaining CPUs go offline.  (The events in step
	Quarx2k#4 and Quarx2k#5 can happen in any order as long as each CPU passes
	through a quiescent state before going offline.)

6.	When the last CPU goes offline, __rcu_offline_cpu() will invoke
	rcu_preempt_offline_tasks(), which will move the task to the
	root rcu_node structure, but without reporting a quiescent state
	up the rcu_node hierarchy (and this failure to report a quiescent
	state is the bug).

	But because this leaf rcu_node structure's ->qsmask field is
	already zero and its ->block_tasks[] entries are all empty,
	force_quiescent_state() will skip this rcu_node structure.

	Therefore, grace periods are now hung.

This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu().  Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug.  This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ryncsn pushed a commit to ryncsn/jordan-kernel that referenced this pull request Apr 18, 2015
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure.  This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy.  Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:

1.	Kernel built for more than 32 CPUs on 32-bit systems or for more
	than 64 CPUs on 64-bit systems, so that there is more than one
	rcu_node structure.  (Or CONFIG_RCU_FANOUT is artificially set
	to a number smaller than CONFIG_NR_CPUS.)

2.	The kernel is built with CONFIG_TREE_PREEMPT_RCU.

3.	A task running on a CPU associated with a given leaf rcu_node
	structure blocks while in an RCU read-side critical section
	-and- that CPU has not yet passed through a quiescent state
	for the current RCU grace period.  This will cause the task
	to be queued on the leaf rcu_node's blocked_tasks[] array, in
	particular, on the element of this array corresponding to the
	current grace period.

4.	Each of the remaining CPUs corresponding to this same leaf rcu_node
	structure pass through a quiescent state.  However, the task is
	still in its RCU read-side critical section, so these quiescent
	states cannot be reported further up the rcu_node hierarchy.
	Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
	field are now zero.

5.	Each of the remaining CPUs go offline.  (The events in step
	Quarx2k#4 and Quarx2k#5 can happen in any order as long as each CPU passes
	through a quiescent state before going offline.)

6.	When the last CPU goes offline, __rcu_offline_cpu() will invoke
	rcu_preempt_offline_tasks(), which will move the task to the
	root rcu_node structure, but without reporting a quiescent state
	up the rcu_node hierarchy (and this failure to report a quiescent
	state is the bug).

	But because this leaf rcu_node structure's ->qsmask field is
	already zero and its ->block_tasks[] entries are all empty,
	force_quiescent_state() will skip this rcu_node structure.

	Therefore, grace periods are now hung.

This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu().  Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug.  This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ryncsn pushed a commit to ryncsn/jordan-kernel that referenced this pull request May 13, 2015
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure.  This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy.  Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:

1.	Kernel built for more than 32 CPUs on 32-bit systems or for more
	than 64 CPUs on 64-bit systems, so that there is more than one
	rcu_node structure.  (Or CONFIG_RCU_FANOUT is artificially set
	to a number smaller than CONFIG_NR_CPUS.)

2.	The kernel is built with CONFIG_TREE_PREEMPT_RCU.

3.	A task running on a CPU associated with a given leaf rcu_node
	structure blocks while in an RCU read-side critical section
	-and- that CPU has not yet passed through a quiescent state
	for the current RCU grace period.  This will cause the task
	to be queued on the leaf rcu_node's blocked_tasks[] array, in
	particular, on the element of this array corresponding to the
	current grace period.

4.	Each of the remaining CPUs corresponding to this same leaf rcu_node
	structure pass through a quiescent state.  However, the task is
	still in its RCU read-side critical section, so these quiescent
	states cannot be reported further up the rcu_node hierarchy.
	Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
	field are now zero.

5.	Each of the remaining CPUs go offline.  (The events in step
	Quarx2k#4 and Quarx2k#5 can happen in any order as long as each CPU passes
	through a quiescent state before going offline.)

6.	When the last CPU goes offline, __rcu_offline_cpu() will invoke
	rcu_preempt_offline_tasks(), which will move the task to the
	root rcu_node structure, but without reporting a quiescent state
	up the rcu_node hierarchy (and this failure to report a quiescent
	state is the bug).

	But because this leaf rcu_node structure's ->qsmask field is
	already zero and its ->block_tasks[] entries are all empty,
	force_quiescent_state() will skip this rcu_node structure.

	Therefore, grace periods are now hung.

This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu().  Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug.  This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ryncsn pushed a commit to ryncsn/jordan-kernel that referenced this pull request Nov 18, 2015
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.

It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.

Patch Quarx2k#1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.

Patch Quarx2k#2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.

Patch Quarx2k#3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.

Patch Quarx2k#4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.

Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.

This patch:

Make subsys[] able to be dynamically populated to support modular
subsystems

This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants