2.6.32 #1

espaciosalter20 · 2012-10-07T16:35:36Z

I want to add some new governors to kernel. Please, check those...

added some new governors

Added new governors

Very rare kernel crashes are reported on a custom OMAP4 board. Kernel panics due to corrupted completion structure while executing dispc_irq_wait_handler(). Excerpt from kernel log: Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP Unable to handle kernel paging request at virtual address 00400130 ... PC is at 0xebf205bc LR is at __wake_up_common+0x54/0x94 ... (__wake_up_common+0x0/0x94) (complete+0x0/0x60) (dispc_irq_wait_handler.36902+0x0/0x14) (omap_dispc_irq_handler+0x0/0x354) (handle_irq_event_percpu+0x0/0x188) (handle_irq_event+0x0/0x64) (handle_fasteoi_irq+0x0/0x10c) (generic_handle_irq+0x0/0x48) (asm_do_IRQ+0x0/0xc0) DISPC IRQ executes callbacks with dispc.irq_lock released. Hence unregister_isr() and DISPC IRQ might be running in parallel on different CPUs. So there is a chance that a callback is executed even though it has been unregistered. As omap_dispc_wait_for_irq_timeout() declares a completion on stack, the dispc_irq_wait_handler() callback might try to access a completion structure that is invalid. This leads to crashes and hangs. Solution is to divide unregister calls into two sets: 1. Non-strict unregistering of callbacks. Callbacks could safely be executed after unregistering them. This is the case with unregister calls from the IRQ handler itself. 2. Strict (synchronized) unregistering. Callbacks are not allowed after unregistering. This is the case with completion waiting. The above solution should satisfy one of the original intentions of the driver: callbacks should be able to unregister themselves. Change-Id: I9b88f5e07f146ae29649ca07b5e4764f35ccd738 Signed-off-by: Dimitar Dimitrov <dddimitrov@mm-sol.com> Conflicts: drivers/video/omap2/dss/wb.c

Very rare kernel crashes are reported on a custom OMAP4 board. Kernel panics due to corrupted completion structure while executing dispc_irq_wait_handler(). Excerpt from kernel log: Internal error: Oops - undefined instruction: 0 [Quarx2k#1] PREEMPT SMP Unable to handle kernel paging request at virtual address 00400130 ... PC is at 0xebf205bc LR is at __wake_up_common+0x54/0x94 ... (__wake_up_common+0x0/0x94) (complete+0x0/0x60) (dispc_irq_wait_handler.36902+0x0/0x14) (omap_dispc_irq_handler+0x0/0x354) (handle_irq_event_percpu+0x0/0x188) (handle_irq_event+0x0/0x64) (handle_fasteoi_irq+0x0/0x10c) (generic_handle_irq+0x0/0x48) (asm_do_IRQ+0x0/0xc0) DISPC IRQ executes callbacks with dispc.irq_lock released. Hence unregister_isr() and DISPC IRQ might be running in parallel on different CPUs. So there is a chance that a callback is executed even though it has been unregistered. As omap_dispc_wait_for_irq_timeout() declares a completion on stack, the dispc_irq_wait_handler() callback might try to access a completion structure that is invalid. This leads to crashes and hangs. Solution is to divide unregister calls into two sets: 1. Non-strict unregistering of callbacks. Callbacks could safely be executed after unregistering them. This is the case with unregister calls from the IRQ handler itself. 2. Strict (synchronized) unregistering. Callbacks are not allowed after unregistering. This is the case with completion waiting. The above solution should satisfy one of the original intentions of the driver: callbacks should be able to unregister themselves. Change-Id: I9b88f5e07f146ae29649ca07b5e4764f35ccd738 Signed-off-by: Dimitar Dimitrov <dddimitrov@mm-sol.com> Conflicts: drivers/video/omap2/dss/wb.c

…uring migration by not migrating temporary stacks Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. For example kernel BUG at include/linux/swapops.h:105! invalid opcode: 0000 [#1] PREEMPT SMP ... Call Trace: [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e [<ffffffff813099b5>] page_fault+0x25/0x30 [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b [<ffffffff8111329b>] search_binary_handler+0x173/0x313 [<ffffffff81114896>] do_execve+0x219/0x30a [<ffffffff8100a5c6>] sys_execve+0x43/0x5e [<ffffffff8100320a>] stub_execve+0x6a/0xc0 RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129 There is a race between shift_arg_pages and migration that triggers this bug. A temporary stack is setup during exec and later moved. If migration moves a page in the temporary stack and the VMA is then removed before migration completes, the migration PTE may not be found leading to a BUG when the stack is faulted. This patch causes pages within the temporary stack during exec to be skipped by migration. It does this by marking the VMA covering the temporary stack with an otherwise impossible combination of VMA flags. These flags are cleared when the temporary stack is moved to its final location. [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Conflicts: fs/exec.c

dump_tasks() needs to hold the RCU read lock around its access of the target task's UID. To this end it should use task_uid() as it only needs that one thing from the creds. The fact that dump_tasks() holds tasklist_lock is insufficient to prevent the target process replacing its credentials on another CPU. Then, this patch change to call rcu_read_lock() explicitly. =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- mm/oom_kill.c:410 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 4 locks held by kworker/1:2/651: #0: (events){+.+.+.}, at: [<ffffffff8106aae7>] process_one_work+0x137/0x4a0 #1: (moom_work){+.+...}, at: [<ffffffff8106aae7>] process_one_work+0x137/0x4a0 #2: (tasklist_lock){.+.+..}, at: [<ffffffff810fafd4>] out_of_memory+0x164/0x3f0 #3: (&(&p->alloc_lock)->rlock){+.+...}, at: [<ffffffff810fa48e>] find_lock_task_mm+0x2e/0x70 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MVFR0 and MVFR1 are only available starting with ARM1136 r1p0 release according to "B.5 VFP changes" in DDI0211F_arm1136_r1p0_trm.pdf. This is also when TLS register got added, so we can use HAS_TLS also to test for MVFR0 and MVFR1. Otherwise VFPFMRX and VFPFMXR access fails and we get: Internal error: Oops - undefined instruction: 0 [#1] PC is at no_old_VFP_process+0x8/0x3c LR is at __und_svc+0x48/0x80 ... Signed-off-by: Tony Lindgren <tony@atomide.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

Errata Titles: i103: Delay needed to read some GP timer, WD timer and sync timer registers after wakeup (OMAP3/4) i767: Delay needed to read some GP timer registers after wakeup (OMAP5) Description (i103/i767): If a General Purpose Timer (GPTimer) is in posted mode (TSICR [2].POSTED=1), due to internal resynchronizations, values read in TCRR, TCAR1 and TCAR2 registers right after the timer interface clock (L4) goes from stopped to active may not return the expected values. The most common event leading to this situation occurs upon wake up from idle. GPTimer non-posted synchronization mode is not impacted by this limitation. Workarounds: 1). Disable posted mode 2). Use static dependency between timer clock domain and MPUSS clock domain 3). Use no-idle mode when the timer is active Workarounds #2 and #3 are not pratical from a power standpoint and so workaround #1 has been implemented. Disabling posted mode adds some CPU overhead for configuring and reading the timers as the CPU has to wait for accesses to be re-synchronised within the timer. However, disabling posted mode guarantees correct operation. Please note that it is safe to use posted mode for timers if the counter (TCRR) and capture (TCARx) registers will never be read. An example of this is the clock-event system timer. This is used by the kernel to schedule events however, the timers counter is never read and capture registers are not used. Given that the kernel configures this timer often yet never reads the counter register it is safe to enable posted mode in this case. Hence, for the timer used for kernel clock-events, posted mode is enabled by overriding the errata for devices that are impacted by this defect. For drivers using the timers that do not read the counter or capture registers and wish to use posted mode, can override the errata and enable posted mode by making the following function calls. __omap_dm_timer_override_errata(timer, OMAP_TIMER_ERRATA_I103_I767); __omap_dm_timer_enable_posted(timer); Both dmtimers and watchdogs are impacted by this defect this patch only implements the workaround for the dmtimer. Currently the watchdog driver does not read the counter register and so no workaround is necessary. Posted mode will be disabled for all OMAP2+ devices (including AM33xx) using a GP timer as a clock-source timer to guarantee correct operation. This is not necessary for OMAP24xx devices but the default clock-source timer for OMAP24xx devices is the 32k-sync timer and not the GP timer and so should not have any impact. This should be re-visited for future devices if this errata is fixed. Confirmed with Vaibhav Hiremath that this bug also impacts AM33xx devices. Signed-off-by: Jon Hunter <jon-hunter@ti.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Change-Id: Id10648050492d8c91ea22093127584f02ec3655b

When unbinding a device so that I could pass it through to a KVM VM, I got the lockdep report below. It looks like a legitimate lock ordering problem: - domain_context_mapping_one() takes iommu->lock and calls iommu_support_dev_iotlb(), which takes device_domain_lock (inside iommu->lock). - domain_remove_one_dev_info() starts by taking device_domain_lock then takes iommu->lock inside it (near the end of the function). So this is the classic AB-BA deadlock. It looks like a safe fix is to simply release device_domain_lock a bit earlier, since as far as I can tell, it doesn't protect any of the stuff accessed at the end of domain_remove_one_dev_info() anyway. BTW, the use of device_domain_lock looks a bit unsafe to me... it's at least not obvious to me why we aren't vulnerable to the race below: iommu_support_dev_iotlb() domain_remove_dev_info() lock device_domain_lock find info unlock device_domain_lock lock device_domain_lock find same info unlock device_domain_lock free_devinfo_mem(info) do stuff with info after it's free However I don't understand the locking here well enough to know if this is a real problem, let alone what the best fix is. Anyway here's the full lockdep output that prompted all of this: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.39.1+ #1 ------------------------------------------------------- bash/13954 is trying to acquire lock: (&(&iommu->lock)->rlock){......}, at: [<ffffffff812f6421>] domain_remove_one_dev_info+0x121/0x230 but task is already holding lock: (device_domain_lock){-.-...}, at: [<ffffffff812f6508>] domain_remove_one_dev_info+0x208/0x230 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (device_domain_lock){-.-...}: [<ffffffff8109ca9d>] lock_acquire+0x9d/0x130 [<ffffffff81571475>] _raw_spin_lock_irqsave+0x55/0xa0 [<ffffffff812f8350>] domain_context_mapping_one+0x600/0x750 [<ffffffff812f84df>] domain_context_mapping+0x3f/0x120 [<ffffffff812f9175>] iommu_prepare_identity_map+0x1c5/0x1e0 [<ffffffff81ccf1ca>] intel_iommu_init+0x88e/0xb5e [<ffffffff81cab204>] pci_iommu_init+0x16/0x41 [<ffffffff81002165>] do_one_initcall+0x45/0x190 [<ffffffff81ca3d3f>] kernel_init+0xe3/0x168 [<ffffffff8157ac24>] kernel_thread_helper+0x4/0x10 -> #0 (&(&iommu->lock)->rlock){......}: [<ffffffff8109bf3e>] __lock_acquire+0x195e/0x1e10 [<ffffffff8109ca9d>] lock_acquire+0x9d/0x130 [<ffffffff81571475>] _raw_spin_lock_irqsave+0x55/0xa0 [<ffffffff812f6421>] domain_remove_one_dev_info+0x121/0x230 [<ffffffff812f8b42>] device_notifier+0x72/0x90 [<ffffffff8157555c>] notifier_call_chain+0x8c/0xc0 [<ffffffff81089768>] __blocking_notifier_call_chain+0x78/0xb0 [<ffffffff810897b6>] blocking_notifier_call_chain+0x16/0x20 [<ffffffff81373a5c>] __device_release_driver+0xbc/0xe0 [<ffffffff81373ccf>] device_release_driver+0x2f/0x50 [<ffffffff81372ee3>] driver_unbind+0xa3/0xc0 [<ffffffff813724ac>] drv_attr_store+0x2c/0x30 [<ffffffff811e4506>] sysfs_write_file+0xe6/0x170 [<ffffffff8117569e>] vfs_write+0xce/0x190 [<ffffffff811759e4>] sys_write+0x54/0xa0 [<ffffffff81579a82>] system_call_fastpath+0x16/0x1b other info that might help us debug this: 6 locks held by bash/13954: #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff811e4464>] sysfs_write_file+0x44/0x170 #1: (s_active#3){++++.+}, at: [<ffffffff811e44ed>] sysfs_write_file+0xcd/0x170 #2: (&__lockdep_no_validate__){+.+.+.}, at: [<ffffffff81372edb>] driver_unbind+0x9b/0xc0 #3: (&__lockdep_no_validate__){+.+.+.}, at: [<ffffffff81373cc7>] device_release_driver+0x27/0x50 #4: (&(&priv->bus_notifier)->rwsem){.+.+.+}, at: [<ffffffff8108974f>] __blocking_notifier_call_chain+0x5f/0xb0 #5: (device_domain_lock){-.-...}, at: [<ffffffff812f6508>] domain_remove_one_dev_info+0x208/0x230 stack backtrace: Pid: 13954, comm: bash Not tainted 2.6.39.1+ #1 Call Trace: [<ffffffff810993a7>] print_circular_bug+0xf7/0x100 [<ffffffff8109bf3e>] __lock_acquire+0x195e/0x1e10 [<ffffffff810972bd>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff8109d57d>] ? trace_hardirqs_on_caller+0x13d/0x180 [<ffffffff8109ca9d>] lock_acquire+0x9d/0x130 [<ffffffff812f6421>] ? domain_remove_one_dev_info+0x121/0x230 [<ffffffff81571475>] _raw_spin_lock_irqsave+0x55/0xa0 [<ffffffff812f6421>] ? domain_remove_one_dev_info+0x121/0x230 [<ffffffff810972bd>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff812f6421>] domain_remove_one_dev_info+0x121/0x230 [<ffffffff812f8b42>] device_notifier+0x72/0x90 [<ffffffff8157555c>] notifier_call_chain+0x8c/0xc0 [<ffffffff81089768>] __blocking_notifier_call_chain+0x78/0xb0 [<ffffffff810897b6>] blocking_notifier_call_chain+0x16/0x20 [<ffffffff81373a5c>] __device_release_driver+0xbc/0xe0 [<ffffffff81373ccf>] device_release_driver+0x2f/0x50 [<ffffffff81372ee3>] driver_unbind+0xa3/0xc0 [<ffffffff813724ac>] drv_attr_store+0x2c/0x30 [<ffffffff811e4506>] sysfs_write_file+0xe6/0x170 [<ffffffff8117569e>] vfs_write+0xce/0x190 [<ffffffff811759e4>] sys_write+0x54/0xa0 [<ffffffff81579a82>] system_call_fastpath+0x16/0x1b Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>

After the call to phy_init_hw failed in phy_attach_direct, phy_detach is called to detach the phy device from its network device. If the attached driver is a generic phy driver, this also detaches the driver. Subsequently phy_resume is called, which assumes without checking that a driver is attached to the device. This will result in a crash such as Unable to handle kernel paging request for data at address 0xffffffffffffff90 Faulting instruction address: 0xc0000000003a0e18 Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c0000000003a0e18] .phy_attach_direct+0x68/0x17c LR [c0000000003a0e6c] .phy_attach_direct+0xbc/0x17c Call Trace: [c0000003fc0475d0] [c0000000003a0e6c] .phy_attach_direct+0xbc/0x17c (unreliable) [c0000003fc047670] [c0000000003a0ff8] .phy_connect_direct+0x28/0x98 [c0000003fc047700] [c0000000003f0074] .of_phy_connect+0x4c/0xa4 Only call phy_resume if phy_init_hw was successful. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>

The following happens when trying to run a kvm guest on a kernel configured for 64k pages. This doesn't happen with 4k pages: BUG: failure at include/linux/mm.h:297/put_page_testzero()! Kernel panic - not syncing: BUG! CPU: 2 PID: 4228 Comm: qemu-system-aar Tainted: GF 3.13.0-0.rc7.31.sa2.k32v1.aarch64.debug #1 Call trace: [<fffffe0000096034>] dump_backtrace+0x0/0x16c [<fffffe00000961b4>] show_stack+0x14/0x1c [<fffffe000066e648>] dump_stack+0x84/0xb0 [<fffffe0000668678>] panic+0xf4/0x220 [<fffffe000018ec78>] free_reserved_area+0x0/0x110 [<fffffe000018edd8>] free_pages+0x50/0x88 [<fffffe00000a759c>] kvm_free_stage2_pgd+0x30/0x40 [<fffffe00000a5354>] kvm_arch_destroy_vm+0x18/0x44 [<fffffe00000a1854>] kvm_put_kvm+0xf0/0x184 [<fffffe00000a1938>] kvm_vm_release+0x10/0x1c [<fffffe00001edc1c>] __fput+0xb0/0x288 [<fffffe00001ede4c>] ____fput+0xc/0x14 [<fffffe00000d5a2c>] task_work_run+0xa8/0x11c [<fffffe0000095c14>] do_notify_resume+0x54/0x58 In arch/arm/kvm/mmu.c:unmap_range(), we end up doing an extra put_page() on the stage2 pgd which leads to the BUG in put_page_testzero(). This happens because a pud_huge() test in unmap_range() returns true when it should always be false with 2-level pages tables used by 64k pages. This patch removes support for huge puds if 2-level pagetables are being used. Signed-off-by: Mark Salter <msalter@redhat.com> [catalin.marinas@arm.com: removed #ifndef around PUD_SIZE check] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> # v3.11+

Macvlan devices try to avoid stacking, but that's not always successfull or even desired. As an example, the following configuration is perefectly legal and valid: eth0 <--- macvlan0 <---- vlan0.10 <--- macvlan1 However, this configuration produces the following lockdep trace: [ 115.620418] ====================================================== [ 115.620477] [ INFO: possible circular locking dependency detected ] [ 115.620516] 3.15.0-rc1+ #24 Not tainted [ 115.620540] ------------------------------------------------------- [ 115.620577] ip/1704 is trying to acquire lock: [ 115.620604] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff815df49c>] dev_uc_sync+0x3c/0x80 [ 115.620686] but task is already holding lock: [ 115.620723] (&macvlan_netdev_addr_lock_key){+.....}, at: [<ffffffff815da5be>] dev_set_rx_mode+0x1e/0x40 [ 115.620795] which lock already depends on the new lock. [ 115.620853] the existing dependency chain (in reverse order) is: [ 115.620894] -> #1 (&macvlan_netdev_addr_lock_key){+.....}: [ 115.620935] [<ffffffff810d57f2>] lock_acquire+0xa2/0x130 [ 115.620974] [<ffffffff816f62e7>] _raw_spin_lock_nested+0x37/0x50 [ 115.621019] [<ffffffffa07296c3>] vlan_dev_set_rx_mode+0x53/0x110 [8021q] [ 115.621066] [<ffffffff815da557>] __dev_set_rx_mode+0x57/0xa0 [ 115.621105] [<ffffffff815da5c6>] dev_set_rx_mode+0x26/0x40 [ 115.621143] [<ffffffff815da6be>] __dev_open+0xde/0x140 [ 115.621174] [<ffffffff815da9ad>] __dev_change_flags+0x9d/0x170 [ 115.621174] [<ffffffff815daaa9>] dev_change_flags+0x29/0x60 [ 115.621174] [<ffffffff815e7f11>] do_setlink+0x321/0x9a0 [ 115.621174] [<ffffffff815ea59f>] rtnl_newlink+0x51f/0x730 [ 115.621174] [<ffffffff815e6e75>] rtnetlink_rcv_msg+0x95/0x250 [ 115.621174] [<ffffffff81608b19>] netlink_rcv_skb+0xa9/0xc0 [ 115.621174] [<ffffffff815e6dca>] rtnetlink_rcv+0x2a/0x40 [ 115.621174] [<ffffffff81608150>] netlink_unicast+0xf0/0x1c0 [ 115.621174] [<ffffffff8160851f>] netlink_sendmsg+0x2ff/0x740 [ 115.621174] [<ffffffff815bc9db>] sock_sendmsg+0x8b/0xc0 [ 115.621174] [<ffffffff815bd4b9>] ___sys_sendmsg+0x369/0x380 [ 115.621174] [<ffffffff815bdbb2>] __sys_sendmsg+0x42/0x80 [ 115.621174] [<ffffffff815bdc02>] SyS_sendmsg+0x12/0x20 [ 115.621174] [<ffffffff816ffd69>] system_call_fastpath+0x16/0x1b [ 115.621174] -> #0 (&vlan_netdev_addr_lock_key/1){+.....}: [ 115.621174] [<ffffffff810d4d43>] __lock_acquire+0x1773/0x1a60 [ 115.621174] [<ffffffff810d57f2>] lock_acquire+0xa2/0x130 [ 115.621174] [<ffffffff816f62e7>] _raw_spin_lock_nested+0x37/0x50 [ 115.621174] [<ffffffff815df49c>] dev_uc_sync+0x3c/0x80 [ 115.621174] [<ffffffffa0696d2a>] macvlan_set_mac_lists+0xca/0x110 [macvlan] [ 115.621174] [<ffffffff815da557>] __dev_set_rx_mode+0x57/0xa0 [ 115.621174] [<ffffffff815da5c6>] dev_set_rx_mode+0x26/0x40 [ 115.621174] [<ffffffff815da6be>] __dev_open+0xde/0x140 [ 115.621174] [<ffffffff815da9ad>] __dev_change_flags+0x9d/0x170 [ 115.621174] [<ffffffff815daaa9>] dev_change_flags+0x29/0x60 [ 115.621174] [<ffffffff815e7f11>] do_setlink+0x321/0x9a0 [ 115.621174] [<ffffffff815ea59f>] rtnl_newlink+0x51f/0x730 [ 115.621174] [<ffffffff815e6e75>] rtnetlink_rcv_msg+0x95/0x250 [ 115.621174] [<ffffffff81608b19>] netlink_rcv_skb+0xa9/0xc0 [ 115.621174] [<ffffffff815e6dca>] rtnetlink_rcv+0x2a/0x40 [ 115.621174] [<ffffffff81608150>] netlink_unicast+0xf0/0x1c0 [ 115.621174] [<ffffffff8160851f>] netlink_sendmsg+0x2ff/0x740 [ 115.621174] [<ffffffff815bc9db>] sock_sendmsg+0x8b/0xc0 [ 115.621174] [<ffffffff815bd4b9>] ___sys_sendmsg+0x369/0x380 [ 115.621174] [<ffffffff815bdbb2>] __sys_sendmsg+0x42/0x80 [ 115.621174] [<ffffffff815bdc02>] SyS_sendmsg+0x12/0x20 [ 115.621174] [<ffffffff816ffd69>] system_call_fastpath+0x16/0x1b [ 115.621174] other info that might help us debug this: [ 115.621174] Possible unsafe locking scenario: [ 115.621174] CPU0 CPU1 [ 115.621174] ---- ---- [ 115.621174] lock(&macvlan_netdev_addr_lock_key); [ 115.621174] lock(&vlan_netdev_addr_lock_key/1); [ 115.621174] lock(&macvlan_netdev_addr_lock_key); [ 115.621174] lock(&vlan_netdev_addr_lock_key/1); [ 115.621174] *** DEADLOCK *** [ 115.621174] 2 locks held by ip/1704: [ 115.621174] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815e6dbb>] rtnetlink_rcv+0x1b/0x40 [ 115.621174] #1: (&macvlan_netdev_addr_lock_key){+.....}, at: [<ffffffff815da5be>] dev_set_rx_mode+0x1e/0x40 [ 115.621174] stack backtrace: [ 115.621174] CPU: 3 PID: 1704 Comm: ip Not tainted 3.15.0-rc1+ #24 [ 115.621174] Hardware name: Hewlett-Packard HP xw8400 Workstation/0A08h, BIOS 786D5 v02.38 10/25/2010 [ 115.621174] ffffffff82339ae0 ffff880465f79568 ffffffff816ee20c ffffffff82339ae0 [ 115.621174] ffff880465f795a8 ffffffff816e9e1b ffff880465f79600 ffff880465b019c8 [ 115.621174] 0000000000000001 0000000000000002 ffff880465b019c8 ffff880465b01230 [ 115.621174] Call Trace: [ 115.621174] [<ffffffff816ee20c>] dump_stack+0x4d/0x66 [ 115.621174] [<ffffffff816e9e1b>] print_circular_bug+0x200/0x20e [ 115.621174] [<ffffffff810d4d43>] __lock_acquire+0x1773/0x1a60 [ 115.621174] [<ffffffff810d3172>] ? trace_hardirqs_on_caller+0xb2/0x1d0 [ 115.621174] [<ffffffff810d57f2>] lock_acquire+0xa2/0x130 [ 115.621174] [<ffffffff815df49c>] ? dev_uc_sync+0x3c/0x80 [ 115.621174] [<ffffffff816f62e7>] _raw_spin_lock_nested+0x37/0x50 [ 115.621174] [<ffffffff815df49c>] ? dev_uc_sync+0x3c/0x80 [ 115.621174] [<ffffffff815df49c>] dev_uc_sync+0x3c/0x80 [ 115.621174] [<ffffffffa0696d2a>] macvlan_set_mac_lists+0xca/0x110 [macvlan] [ 115.621174] [<ffffffff815da557>] __dev_set_rx_mode+0x57/0xa0 [ 115.621174] [<ffffffff815da5c6>] dev_set_rx_mode+0x26/0x40 [ 115.621174] [<ffffffff815da6be>] __dev_open+0xde/0x140 [ 115.621174] [<ffffffff815da9ad>] __dev_change_flags+0x9d/0x170 [ 115.621174] [<ffffffff815daaa9>] dev_change_flags+0x29/0x60 [ 115.621174] [<ffffffff811e1db1>] ? mem_cgroup_bad_page_check+0x21/0x30 [ 115.621174] [<ffffffff815e7f11>] do_setlink+0x321/0x9a0 [ 115.621174] [<ffffffff810d394c>] ? __lock_acquire+0x37c/0x1a60 [ 115.621174] [<ffffffff815ea59f>] rtnl_newlink+0x51f/0x730 [ 115.621174] [<ffffffff815ea169>] ? rtnl_newlink+0xe9/0x730 [ 115.621174] [<ffffffff815e6e75>] rtnetlink_rcv_msg+0x95/0x250 [ 115.621174] [<ffffffff810d329d>] ? trace_hardirqs_on+0xd/0x10 [ 115.621174] [<ffffffff815e6dbb>] ? rtnetlink_rcv+0x1b/0x40 [ 115.621174] [<ffffffff815e6de0>] ? rtnetlink_rcv+0x40/0x40 [ 115.621174] [<ffffffff81608b19>] netlink_rcv_skb+0xa9/0xc0 [ 115.621174] [<ffffffff815e6dca>] rtnetlink_rcv+0x2a/0x40 [ 115.621174] [<ffffffff81608150>] netlink_unicast+0xf0/0x1c0 [ 115.621174] [<ffffffff8160851f>] netlink_sendmsg+0x2ff/0x740 [ 115.621174] [<ffffffff815bc9db>] sock_sendmsg+0x8b/0xc0 [ 115.621174] [<ffffffff8119d4af>] ? might_fault+0x5f/0xb0 [ 115.621174] [<ffffffff8119d4f8>] ? might_fault+0xa8/0xb0 [ 115.621174] [<ffffffff8119d4af>] ? might_fault+0x5f/0xb0 [ 115.621174] [<ffffffff815cb51e>] ? verify_iovec+0x5e/0xe0 [ 115.621174] [<ffffffff815bd4b9>] ___sys_sendmsg+0x369/0x380 [ 115.621174] [<ffffffff816faa0d>] ? __do_page_fault+0x11d/0x570 [ 115.621174] [<ffffffff810cfe9f>] ? up_read+0x1f/0x40 [ 115.621174] [<ffffffff816fab04>] ? __do_page_fault+0x214/0x570 [ 115.621174] [<ffffffff8120a10b>] ? mntput_no_expire+0x6b/0x1c0 [ 115.621174] [<ffffffff8120a0b7>] ? mntput_no_expire+0x17/0x1c0 [ 115.621174] [<ffffffff8120a284>] ? mntput+0x24/0x40 [ 115.621174] [<ffffffff815bdbb2>] __sys_sendmsg+0x42/0x80 [ 115.621174] [<ffffffff815bdc02>] SyS_sendmsg+0x12/0x20 [ 115.621174] [<ffffffff816ffd69>] system_call_fastpath+0x16/0x1b Fix this by correctly providing macvlan lockdep class. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>

When running send, if an inode only has extended reference items associated to it and no regular references, send.c:get_first_ref() was incorrectly assuming the reference it found was of type BTRFS_INODE_REF_KEY due to use of the wrong key variable. This caused weird behaviour when using the found item has a regular reference, such as weird path string, and occasionally (when lucky) a crash: [ 190.600652] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [ 190.600994] Modules linked in: btrfs xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc psmouse serio_raw evbug pcspkr i2c_piix4 e1000 floppy [ 190.602565] CPU: 2 PID: 14520 Comm: btrfs Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 190.602728] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 190.602868] task: ffff8800d447c920 ti: ffff8801fa79e000 task.ti: ffff8801fa79e000 [ 190.603030] RIP: 0010:[<ffffffff813266b4>] [<ffffffff813266b4>] memcpy+0x54/0x110 [ 190.603262] RSP: 0018:ffff8801fa79f880 EFLAGS: 00010202 [ 190.603395] RAX: ffff8800d4326e3f RBX: 000000000000036a RCX: ffff880000000000 [ 190.603553] RDX: 000000000000032a RSI: ffe708844042936a RDI: ffff8800d43271a9 [ 190.603710] RBP: ffff8801fa79f8c8 R08: 00000000003a4ef0 R09: 0000000000000000 [ 190.603867] R10: 793a4ef09f000000 R11: 9f0000000053726f R12: ffff8800d43271a9 [ 190.604020] R13: 0000160000000000 R14: ffff8802110134f0 R15: 000000000000036a [ 190.604020] FS: 00007fb423d09b80(0000) GS:ffff880216200000(0000) knlGS:0000000000000000 [ 190.604020] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 190.604020] CR2: 00007fb4229d4b78 CR3: 00000001f5d76000 CR4: 00000000000006e0 [ 190.604020] Stack: [ 190.604020] ffffffffa01f4d49 ffff8801fa79f8f0 00000000000009f9 ffff8801fa79f8c8 [ 190.604020] 00000000000009f9 ffff880211013260 000000000000f971 ffff88021147dba8 [ 190.604020] 00000000000009f9 ffff8801fa79f918 ffffffffa02367f5 ffff8801fa79f928 [ 190.604020] Call Trace: [ 190.604020] [<ffffffffa01f4d49>] ? read_extent_buffer+0xb9/0x120 [btrfs] [ 190.604020] [<ffffffffa02367f5>] fs_path_add_from_extent_buffer+0x45/0x60 [btrfs] [ 190.604020] [<ffffffffa0238806>] get_first_ref+0x1f6/0x210 [btrfs] [ 190.604020] [<ffffffffa0238994>] __get_cur_name_and_parent+0x174/0x3a0 [btrfs] [ 190.604020] [<ffffffff8118df3d>] ? kmem_cache_alloc_trace+0x11d/0x1e0 [ 190.604020] [<ffffffffa0236674>] ? fs_path_alloc+0x24/0x60 [btrfs] [ 190.604020] [<ffffffffa0238c91>] get_cur_path+0xd1/0x240 [btrfs] (...) Steps to reproduce (either crash or some weirdness like an odd path string): mkfs.btrfs -f -O extref /dev/sdd mount /dev/sdd /mnt mkdir /mnt/testdir touch /mnt/testdir/foobar for i in `seq 1 2550`; do ln /mnt/testdir/foobar /mnt/testdir/foobar_link_`printf "%04d" $i` done ln /mnt/testdir/foobar /mnt/testdir/final_foobar_name rm -f /mnt/testdir/foobar for i in `seq 1 2550`; do rm -f /mnt/testdir/foobar_link_`printf "%04d" $i` done btrfs subvolume snapshot -r /mnt /mnt/mysnap btrfs send /mnt/mysnap -f /tmp/mysnap.send Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Kelly reported the following crash: IP: [<ffffffff817a993d>] tcf_action_exec+0x46/0x90 PGD 3009067 PUD 300c067 PMD 11ff30067 PTE 800000011634b060 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 639 Comm: dhclient Not tainted 3.15.0-rc4+ #342 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8801169ecd00 ti: ffff8800d21b8000 task.ti: ffff8800d21b8000 RIP: 0010:[<ffffffff817a993d>] [<ffffffff817a993d>] tcf_action_exec+0x46/0x90 RSP: 0018:ffff8800d21b9b90 EFLAGS: 00010283 RAX: 00000000ffffffff RBX: ffff88011634b8e8 RCX: ffff8800cf7133d8 RDX: ffff88011634b900 RSI: ffff8800cf7133e0 RDI: ffff8800d210f840 RBP: ffff8800d21b9bb0 R08: ffffffff8287bf60 R09: 0000000000000001 R10: ffff8800d2b22b24 R11: 0000000000000001 R12: ffff8800d210f840 R13: ffff8800d21b9c50 R14: ffff8800cf7133e0 R15: ffff8800cad433d8 FS: 00007f49723e1840(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88011634b8f0 CR3: 00000000ce469000 CR4: 00000000000006e0 Stack: ffff8800d2170188 ffff8800d210f840 ffff8800d2171b90 0000000000000000 ffff8800d21b9be8 ffffffff817c55bb ffff8800d21b9c50 ffff8800d2171b90 ffff8800d210f840 ffff8800d21b0300 ffff8800d21b9c50 ffff8800d21b9c18 Call Trace: [<ffffffff817c55bb>] tcindex_classify+0x88/0x9b [<ffffffff817a7f7d>] tc_classify_compat+0x3e/0x7b [<ffffffff817a7fdf>] tc_classify+0x25/0x9f [<ffffffff817b0e68>] htb_enqueue+0x55/0x27a [<ffffffff817b6c2e>] dsmark_enqueue+0x165/0x1a4 [<ffffffff81775642>] __dev_queue_xmit+0x35e/0x536 [<ffffffff8177582a>] dev_queue_xmit+0x10/0x12 [<ffffffff818f8ecd>] packet_sendmsg+0xb26/0xb9a [<ffffffff810b1507>] ? __lock_acquire+0x3ae/0xdf3 [<ffffffff8175cf08>] __sock_sendmsg_nosec+0x25/0x27 [<ffffffff8175d916>] sock_aio_write+0xd0/0xe7 [<ffffffff8117d6b8>] do_sync_write+0x59/0x78 [<ffffffff8117d84d>] vfs_write+0xb5/0x10a [<ffffffff8117d96a>] SyS_write+0x49/0x7f [<ffffffff8198e212>] system_call_fastpath+0x16/0x1b This is because we memcpy struct tcindex_filter_result which contains struct tcf_exts, obviously struct list_head can not be simply copied. This is a regression introduced by commit 33be627 (net_sched: act: use standard struct list_head). It's not very easy to fix it as the code is a mess: if (old_r) memcpy(&cr, r, sizeof(cr)); else { memset(&cr, 0, sizeof(cr)); tcf_exts_init(&cr.exts, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); } ... tcf_exts_change(tp, &cr.exts, &e); ... memcpy(r, &cr, sizeof(cr)); the above code should equal to: tcindex_filter_result_init(&cr); if (old_r) cr.res = r->res; ... if (old_r) tcf_exts_change(tp, &r->exts, &e); else tcf_exts_change(tp, &cr.exts, &e); ... r->res = cr.res; after this change, since there is no need to copy struct tcf_exts. And it also fixes other places zero'ing struct's contains struct tcf_exts. Fixes: commit 33be627 (net_sched: act: use standard struct list_head) Reported-by: Kelly Anderson <kelly@xilka.com> Tested-by: Kelly Anderson <kelly@xilka.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Commit 284f39a ("mm: memcg: push !mm handling out to page cache charge function") explicitly checks for page cache charges without any mm context (from kernel thread context[1]). This seemed to be the only possible case where memory could be charged without mm context so commit 03583f1 ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()") removed the mm check from get_mem_cgroup_from_mm(). This however caused another NULL ptr dereference during early boot when loopback kernel thread splices to tmpfs as reported by Stephan Kulow: BUG: unable to handle kernel NULL pointer dereference at 0000000000000360 IP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60 Oops: 0000 [#1] SMP Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop ppa] CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: __mem_cgroup_try_charge_swapin+0x40/0xe0 mem_cgroup_charge_file+0x8b/0xd0 shmem_getpage_gfp+0x66b/0x7b0 shmem_file_splice_read+0x18f/0x430 splice_direct_to_actor+0xa2/0x1c0 do_lo_receive+0x5a/0x60 [loop] loop_thread+0x298/0x720 [loop] kthread+0xc6/0xe0 ret_from_fork+0x7c/0xb0 Also Branimir Maksimovic reported the following oops which is tiggered for the swapcache charge path from the accounting code for kernel threads: CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P OE 3.15.0-rc5-core2-custom #159 Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013 task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000 RIP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60 Call Trace: __mem_cgroup_try_charge_swapin+0x45/0xf0 mem_cgroup_charge_file+0x9c/0xe0 shmem_getpage_gfp+0x62c/0x770 shmem_write_begin+0x38/0x40 generic_perform_write+0xc5/0x1c0 __generic_file_aio_write+0x1d1/0x3f0 generic_file_aio_write+0x4f/0xc0 do_sync_write+0x5a/0x90 do_acct_process+0x4b1/0x550 acct_process+0x6d/0xa0 do_exit+0x827/0xa70 kthread+0xc3/0xf0 This patch fixes the issue by reintroducing mm check into get_mem_cgroup_from_mm. We could do the same trick in __mem_cgroup_try_charge_swapin as we do for the regular page cache path but it is not worth troubles. The check is not that expensive and it is better to have get_mem_cgroup_from_mm more robust. [1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2 Fixes: 03583f1 ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()") Reported-and-tested-by: Stephan Kulow <coolo@suse.com> Reported-by: Branimir Maksimovic <branimir.maksimovic@gmail.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

…ssion() While running stress tests on adding and deleting ftrace instances I hit this bug: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: selinux_inode_permission+0x85/0x160 PGD 63681067 PUD 7ddbe067 PMD 0 Oops: 0000 [#1] PREEMPT CPU: 0 PID: 5634 Comm: ftrace-test-mki Not tainted 3.13.0-rc4-test-00033-gd2a6dde-dirty #20 Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006 task: ffff880078375800 ti: ffff88007ddb0000 task.ti: ffff88007ddb0000 RIP: 0010:[<ffffffff812d8bc5>] [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160 RSP: 0018:ffff88007ddb1c48 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000800000 RCX: ffff88006dd43840 RDX: 0000000000000001 RSI: 0000000000000081 RDI: ffff88006ee46000 RBP: ffff88007ddb1c88 R08: 0000000000000000 R09: ffff88007ddb1c54 R10: 6e6576652f6f6f66 R11: 0000000000000003 R12: 0000000000000000 R13: 0000000000000081 R14: ffff88006ee46000 R15: 0000000000000000 FS: 00007f217b5b6700(0000) GS:ffffffff81e21000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M CR2: 0000000000000020 CR3: 000000006a0fe000 CR4: 00000000000007f0 Call Trace: security_inode_permission+0x1c/0x30 __inode_permission+0x41/0xa0 inode_permission+0x18/0x50 link_path_walk+0x66/0x920 path_openat+0xa6/0x6c0 do_filp_open+0x43/0xa0 do_sys_open+0x146/0x240 SyS_open+0x1e/0x20 system_call_fastpath+0x16/0x1b Code: 84 a1 00 00 00 81 e3 00 20 00 00 89 d8 83 c8 02 40 f6 c6 04 0f 45 d8 40 f6 c6 08 74 71 80 cf 02 49 8b 46 38 4c 8d 4d cc 45 31 c0 <0f> b7 50 20 8b 70 1c 48 8b 41 70 89 d9 8b 78 04 e8 36 cf ff ff RIP selinux_inode_permission+0x85/0x160 CR2: 0000000000000020 Investigating, I found that the inode->i_security was NULL, and the dereference of it caused the oops. in selinux_inode_permission(): isec = inode->i_security; rc = avc_has_perm_noaudit(sid, isec->sid, isec->sclass, perms, 0, &avd); Note, the crash came from stressing the deletion and reading of debugfs files. I was not able to recreate this via normal files. But I'm not sure they are safe. It may just be that the race window is much harder to hit. What seems to have happened (and what I have traced), is the file is being opened at the same time the file or directory is being deleted. As the dentry and inode locks are not held during the path walk, nor is the inodes ref counts being incremented, there is nothing saving these structures from being discarded except for an rcu_read_lock(). The rcu_read_lock() protects against freeing of the inode, but it does not protect freeing of the inode_security_struct. Now if the freeing of the i_security happens with a call_rcu(), and the i_security field of the inode is not changed (it gets freed as the inode gets freed) then there will be no issue here. (Linus Torvalds suggested not setting the field to NULL such that we do not need to check if it is NULL in the permission check). Note, this is a hack, but it fixes the problem at hand. A real fix is to restructure the destroy_inode() to call all the destructor handlers from the RCU callback. But that is a major job to do, and requires a lot of work. For now, we just band-aid this bug with this fix (it works), and work on a more maintainable solution in the future. Link: http://lkml.kernel.org/r/20140109101932.0508dec7@gandalf.local.home Link: http://lkml.kernel.org/r/20140109182756.17abaaa8@gandalf.local.home Cc: stable@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Setting an empty security context (length=0) on a file will lead to incorrectly dereferencing the type and other fields of the security context structure, yielding a kernel BUG. As a zero-length security context is never valid, just reject all such security contexts whether coming from userspace via setxattr or coming from the filesystem upon a getxattr request by SELinux. Setting a security context value (empty or otherwise) unknown to SELinux in the first place is only possible for a root process (CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only if the corresponding SELinux mac_admin permission is also granted to the domain by policy. In Fedora policies, this is only allowed for specific domains such as livecd for setting down security contexts that are not defined in the build host policy. [On Android, this can only be set by root/CAP_MAC_ADMIN processes, and if running SELinux in enforcing mode, only if mac_admin permission is granted in policy. In Android 4.4, this would only be allowed for root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current AOSP master, mac_admin is not allowed for any domains except the recovery console which has a legitimate need for it. The other potential vector is mounting a maliciously crafted filesystem for which SELinux fetches xattrs (e.g. an ext4 filesystem on a SDcard). However, the end result is only a local denial-of-service (DOS) due to kernel BUG. This fix is queued for 3.14.] Reproducer: su setenforce 0 touch foo setfattr -n security.selinux foo Caveat: Relabeling or removing foo after doing the above may not be possible without booting with SELinux disabled. Any subsequent access to foo after doing the above will also trigger the BUG. BUG output from Matthew Thode: [ 473.893141] ------------[ cut here ]------------ [ 473.962110] kernel BUG at security/selinux/ss/services.c:654! [ 473.995314] invalid opcode: 0000 [#6] SMP [ 474.027196] Modules linked in: [ 474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G D I 3.13.0-grsec #1 [ 474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0 07/29/10 [ 474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti: ffff8805f50cd488 [ 474.183707] RIP: 0010:[<ffffffff814681c7>] [<ffffffff814681c7>] context_struct_compute_av+0xce/0x308 [ 474.219954] RSP: 0018:ffff8805c0ac3c38 EFLAGS: 00010246 [ 474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX: 0000000000000100 [ 474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI: ffff8805e8aaa000 [ 474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09: 0000000000000006 [ 474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12: 0000000000000006 [ 474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15: 0000000000000000 [ 474.453816] FS: 00007f2e75220800(0000) GS:ffff88061fc00000(0000) knlGS:0000000000000000 [ 474.489254] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4: 00000000000207f0 [ 474.556058] Stack: [ 474.584325] ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98 ffff8805f1190a40 [ 474.618913] ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990 ffff8805e8aac860 [ 474.653955] ffff8805c0ac3cb8 000700068113833a ffff880606c75060 ffff8805c0ac3d94 [ 474.690461] Call Trace: [ 474.723779] [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a [ 474.778049] [<ffffffff81468824>] security_compute_av+0xf4/0x20b [ 474.811398] [<ffffffff8196f419>] avc_compute_av+0x2a/0x179 [ 474.843813] [<ffffffff8145727b>] avc_has_perm+0x45/0xf4 [ 474.875694] [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31 [ 474.907370] [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e [ 474.938726] [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22 [ 474.970036] [<ffffffff811b057d>] vfs_getattr+0x19/0x2d [ 475.000618] [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91 [ 475.030402] [<ffffffff811b063b>] vfs_lstat+0x19/0x1b [ 475.061097] [<ffffffff811b077e>] SyS_newlstat+0x15/0x30 [ 475.094595] [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3 [ 475.148405] [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b [ 475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48 8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7 75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8 [ 475.255884] RIP [<ffffffff814681c7>] context_struct_compute_av+0xce/0x308 [ 475.296120] RSP <ffff8805c0ac3c38> [ 475.328734] ---[ end trace f076482e9d754adc ]--- [sds: commit message edited to note Android implications and to generate a unique Change-Id for gerrit] Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413 Reported-by: Matthew Thode <mthode@mthode.org> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: stable@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com>

It appears possible for the dma to get nuked while the irq handler is pending/about to run. The dma is cancelled by nuke() but the irq handler will still run and wind up dereferencing NULL, so add a check. Crash log: [ 7629.098815] Unable to handle kernel NULL pointer dereference at virtual address 00000004 [ 7629.099273] pgd = c0004000 [ 7629.099456] [00000004] *pgd=00000000 [ 7629.099761] Internal error: Oops: 17 [#1] PREEMPT SMP [ 7629.100067] Modules linked in: [ 7629.100402] CPU: 0 Not tainted (3.0.8-gcb25bf2 #1) [ 7629.100585] PC is at txstate+0x74/0x1ec [ 7629.100891] LR is at musb_g_tx+0x138/0x1d8 [ 7629.101074] pc : [<c033f1cc>] lr : [<c033fc9c>] psr: 20000193 [ 7629.101074] sp : c06efd60 ip : c06efd98 fp : c06efd94 [ 7629.101531] r10: 00000018 r9 : 00000000 r8 : 00000200 [ 7629.101684] r7 : 00000000 r6 : fc0ab110 r5 : c7852448 r4 : c7125c00 [ 7629.101867] r3 : 00000000 r2 : 00000000 r1 : 00000002 r0 : c78520e4 ... [ 7629.228668] [<c033f158>] (txstate+0x0/0x1ec) from [<c033fc9c>] (musb_g_tx+0x138/0x1d8) [ 7629.229003] [<c033fb64>] (musb_g_tx+0x0/0x1d8) from [<c033d35c>] (musb_interrupt+0x104/0x8cc) [ 7629.229187] [<c033d258>] (musb_interrupt+0x0/0x8cc) from [<c033db7c>] (generic_interrupt+0x58/0x8c) [ 7629.229522] [<c033db24>] (generic_interrupt+0x0/0x8c) from [<c00da008>] (handle_irq_event_percpu+0x54/0x188) [ 7629.229827] r6:00000000 r5:c06f3f4c r4:c78f49c0 r3:c033db24 [Ruslan] Cheery-picked from k3.0 commit-id 0cf6404 Change-Id: I305c80a246f5161f2d3d20839c5603376d5b746c Signed-off-by: Mike J. Chen <mjchen@google.com> Signed-off-by: Ruslan Bilovol <ruslan.bilovol@ti.com>

MVFR0 and MVFR1 are only available starting with ARM1136 r1p0 release according to "B.5 VFP changes" in DDI0211F_arm1136_r1p0_trm.pdf. This is also when TLS register got added, so we can use HAS_TLS also to test for MVFR0 and MVFR1. Otherwise VFPFMRX and VFPFMXR access fails and we get: Internal error: Oops - undefined instruction: 0 [#1] PC is at no_old_VFP_process+0x8/0x3c LR is at __und_svc+0x48/0x80 ... Signed-off-by: Tony Lindgren <tony@atomide.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

cpufreq_register_driver sets cpufreq_driver to a structure owned (and placed) in the caller's memory. If cpufreq policy fails in its ->init function, sysdev_driver_register returns nonzero in cpufreq_register_driver. Now, cpufreq_register_driver returns an error without setting cpufreq_driver back to NULL. Usually cpufreq policy modules are unloaded because they propagate the error to the module init function and return that. So a later access to any member of cpufreq_driver causes bugs like: BUG: unable to handle kernel paging request at ffffffffa00270a0 IP: [<ffffffff8145eca3>] cpufreq_cpu_get+0x53/0xe0 PGD 1805067 PUD 1809063 PMD 1c3f90067 PTE 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/net/tun0/statistics/collisions CPU 0 Modules linked in: ... Pid: 5677, comm: thunderbird-bin Tainted: G W 2.6.38-rc4-mm1_64+ #1389 To be filled by O.E.M./To Be Filled By O.E.M. RIP: 0010:[<ffffffff8145eca3>] [<ffffffff8145eca3>] cpufreq_cpu_get+0x53/0xe0 RSP: 0018:ffff8801aec37d98 EFLAGS: 00010086 RAX: 0000000000000202 RBX: 0000000000000000 RCX: 0000000000000001 RDX: ffffffffa00270a0 RSI: 0000000000001000 RDI: ffffffff8199ece8 ... Call Trace: [<ffffffff8145f490>] cpufreq_quick_get+0x10/0x30 [<ffffffff8103f12b>] show_cpuinfo+0x2ab/0x300 [<ffffffff81136292>] seq_read+0xf2/0x3f0 [<ffffffff8126c5d3>] ? __strncpy_from_user+0x33/0x60 [<ffffffff8116850d>] proc_reg_read+0x6d/0xa0 [<ffffffff81116e53>] vfs_read+0xc3/0x180 [<ffffffff81116f5c>] sys_read+0x4c/0x90 [<ffffffff81030dbb>] system_call_fastpath+0x16/0x1b ... It's all cause by weird fail path handling in cpufreq_register_driver. To fix that, shuffle the code to do proper handling with gotos. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Dave Jones <davej@redhat.com>

…entials It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001 FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0) Stack: ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45 <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000 <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246 Call Trace: [<ffffffff810698cd>] put_cred+0x13/0x15 [<ffffffff81069b45>] commit_creds+0x16b/0x175 [<ffffffff8106aace>] set_current_groups+0x47/0x4e [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00 48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b 04 25 00 cc 00 00 48 3b b8 58 04 00 00 75 RIP [<ffffffff81069881>] __put_cred+0xc/0x45 RSP <ffff88019e7e9eb8> ---[ end trace df391256a100ebdd ]--- Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit ef962df upstream. Inlined xattr shared free space of inode block with inlined data or data extent record, so the size of the later two should be adjusted when inlined xattr is enabled. See ocfs2_xattr_ibody_init(). But this isn't done well when reflink. For inode with inlined data, its max inlined data size is adjusted in ocfs2_duplicate_inline_data(), no problem. But for inode with data extent record, its record count isn't adjusted. Fix it, or data extent record and inlined xattr may overwrite each other, then cause data corruption or xattr failure. One panic caused by this bug in our test environment is the following: kernel BUG at fs/ocfs2/xattr.c:1435! invalid opcode: 0000 [#1] SMP Pid: 10871, comm: multi_reflink_t Not tainted 2.6.39-300.17.1.el5uek #1 RIP: ocfs2_xa_offset_pointer+0x17/0x20 [ocfs2] RSP: e02b:ffff88007a587948 EFLAGS: 00010283 RAX: 0000000000000000 RBX: 0000000000000010 RCX: 00000000000051e4 RDX: ffff880057092060 RSI: 0000000000000f80 RDI: ffff88007a587a68 RBP: ffff88007a587948 R08: 00000000000062f4 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000010 R13: ffff88007a587a68 R14: 0000000000000001 R15: ffff88007a587c68 FS: 00007fccff7f06e0(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000015cf000 CR3: 000000007aa76000 CR4: 0000000000000660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process multi_reflink_t Call Trace: ocfs2_xa_reuse_entry+0x60/0x280 [ocfs2] ocfs2_xa_prepare_entry+0x17e/0x2a0 [ocfs2] ocfs2_xa_set+0xcc/0x250 [ocfs2] ocfs2_xattr_ibody_set+0x98/0x230 [ocfs2] __ocfs2_xattr_set_handle+0x4f/0x700 [ocfs2] ocfs2_xattr_set+0x6c6/0x890 [ocfs2] ocfs2_xattr_user_set+0x46/0x50 [ocfs2] generic_setxattr+0x70/0x90 __vfs_setxattr_noperm+0x80/0x1a0 vfs_setxattr+0xa9/0xb0 setxattr+0xc3/0x120 sys_fsetxattr+0xa8/0xd0 system_call_fastpath+0x16/0x1b Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Race condition is possible in the hci_tty driver. The race result in NULL pointer dereference due to struct sk_buff_head rx_list is used without prior initialization. The error condition can easily reproduced with the script and COM-7 wilink hardware module: while [ 1 ]; do echo -n "fail" > /dev/nfc; sleep 2; done [ 56.229614] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [ 56.238494] pgd = c0004000 [ 56.241485] [00000000] *pgd=00000000 [ 56.245513] Internal error: Oops: 805 [#1] PREEMPT SMP ARM [ 56.251586] Modules linked in: rproc_drm(O) tf_driver(O) gps_drv wl18xx(O) wl12xx(O) wlcore(O) mac80211(O) pvrsrvkm_sgx544_112(O) cfg80211(O) compat(O) [last unloaded: wlcore_sdio] [ 56.270141] CPU: 0 Tainted: G W O (3.4.34-01546-g66a9034 #82) [ 56.277618] PC is at skb_queue_tail+0x2c/0x50 [ 56.282409] LR is at _raw_spin_lock_irqsave+0x10/0x14 [ 56.287994] pc : [<c04eeb04>] lr : [<c0686618>] psr: 60000193 [ 56.287994] sp : d6cbde60 ip : d6cbde50 fp : d6cbde7c [ 56.300628] r10: d6ecf5d4 r9 : d6ecf558 r8 : 00000000 [ 56.306335] r7 : d6ecf5ac r6 : cb78347c r5 : d67c6500 r4 : cb783470 [ 56.313537] r3 : 00000000 r2 : a0000193 r1 : d67c6500 r0 : a0000193 [ 56.320648] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment kernel [ 56.328796] Control: 10c5387d Table: 8b6a804a DAC: 00000015 [skiped ...] [ 57.452667] Backtrace: [ 57.455474] [<c04eead8>] (skb_queue_tail+0x0/0x50) from [<c0300bec>] (st_receive+0x18/0x34) [ 57.464630] r6:d5920c59 r5:00000004 r4:cb783440 r3:c0300bd4 [ 57.471191] [<c0300bd4>] (st_receive+0x0/0x34) from [<c02fe924>] (st_send_frame+0x50/0xac) [ 57.480255] r4:d6ecf540 r3:c0300bd4 [ 57.484344] [<c02fe8d4>] (st_send_frame+0x0/0xac) from [<c02ff13c>] (st_int_recv+0x1fc/0x3a0) [ 57.493713] r5:00000000 r4:d6ecf540 [ 57.497894] [<c02fef40>] (st_int_recv+0x0/0x3a0) from [<c02fe548>] (st_tty_receive+0x24/0x28) [ 57.507171] [<c02fe524>] (st_tty_receive+0x0/0x28) from [<c02bbac0>] (flush_to_ldisc+0x150/0x1b4) [ 57.516906] [<c02bb970>] (flush_to_ldisc+0x0/0x1b4) from [<c0061950>] (process_one_work+0x134/0x4ac) [ 57.526916] [<c006181c>] (process_one_work+0x0/0x4ac) from [<c0061e54>] (worker_thread+0x18c/0x3d8) [ 57.536865] [<c0061cc8>] (worker_thread+0x0/0x3d8) from [<c00668d0>] (kthread+0x90/0x9c) [ 57.545684] [<c0066840>] (kthread+0x0/0x9c) from [<c004a8a8>] (do_exit+0x0/0x804) [ 57.553894] r6:c004a8a8 r5:c0066840 r4:d6c5dec4 Change-Id: Ife34d53b4fad45d1db600d71450b06dce0328b2c Signed-off-by: Oleksandr Kozaruk <oleksandr.kozaruk@ti.com>

In case of error condition registered protocols were not unregistered from st core when open syscall was called. Memory allocated for the driver data was just freed. Then callback from st_register() called st_reg_complete_cb, and complete() had the argument from the already freed memory. This could be the reason of the null pointer dereference, as in the log below. The issue is rarely reproduced. [ 132.189086] (stc): st_register(12) [ 132.198394] (stc): chnl_id list empty :12 [ 132.205352] (stk) : st_kim_start [ 132.329040] (stk) :ldisc_install = 1 [ 133.087829] mtp_open [ 133.329010] (stk) :ldisc installation timeout [ 133.334960] (stk) :ldisc_install = 0 [ 134.336853] (stk) : timed out waiting for ldisc to be un-installed [ 134.463165] (stk) :ldisc_install = 1 [ 135.469757] (stk) :ldisc installation timeout [ 135.474334] (stk) :ldisc_install = 0 [ 135.557830] init: sys_prop: permission denied uid:1003 name:service.bootanim.exit [ 135.653076] init: Boot Animation exit [ 135.895965] (hci_tty): inside hci_tty_open (d66bce38, d66ff480) [ 135.902435] (stc): st_register(4) [ 135.906494] (stc): ST_REG_IN_PROGRESS:4 [ 135.910858] (stc): add_channel_to_table: id 4 [ 136.477478] (stk) : timed out waiting for ldisc to be un-installed [ 136.594635] (stk) :ldisc_install = 1 [ 137.595001] (stk) :ldisc installation timeout [ 137.603759] (stk) :ldisc_install = 0 [ 138.602508] (stk) : timed out waiting for ldisc to be un-installed [ 138.727478] (stk) :ldisc_install = 1 [ 139.727722] (stk) :ldisc installation timeout [ 139.733734] (stk) :ldisc_install = 0 [ 140.580657] binder: release 1324:1324 transaction 19588 out, still active [ 140.735321] (stk) : timed out waiting for ldisc to be un-installed [ 140.852447] (stk) :ldisc_install = 1 [ 141.852478] (stk) :ldisc installation timeout [ 141.857360] (stk) :ldisc_install = 0 [ 141.914978] (hci_tty): Timeout(6 sec),didn't get reg completion signal from ST [ 142.868072] (stk) : timed out waiting for ldisc to be un-installed [ 142.985382] (stk) :ldisc_install = 1 [ 143.985351] (stk) :ldisc installation timeout [ 143.991546] (stk) :ldisc_install = 0 [ 144.993072] (stk) : timed out waiting for ldisc to be un-installed [ 145.002960] (stc): KIM failure complete callback [ 145.008392] (stc): st_reg_complete [ 145.012725] (hci_tty): @ st_reg_completion_cb [ 145.017639] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [ 145.026428] pgd = c7cc0000 [ 145.029388] [00000010] *pgd=00000000 [ 145.033386] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 145.039215] Modules linked in: rproc_drm(O) tf_driver(O) gps_drv wl18xx(O) wl12xx(O) wlcore(O) mac80211(O) cfg80211(O) pvrsrvkm_sgx540_120(O) compat(O) [ 145.054870] CPU: 1 Tainted: G W O (3.4.34 #1) [ 145.060821] PC is at __wake_up_common+0x2c/0x94 [ 145.065734] LR is at complete+0x4c/0x60 [ 145.069915] pc : [<c006ee4c>] lr : [<c0070378>] psr: a0000093 [skiped...] [ 146.023742] Backtrace: [ 146.026550] [<c006ee20>] (__wake_up_common+0x0/0x94) from [<c0070378>] (complete+0x4c/0x60) [ 146.035644] [<c007032c>] (complete+0x0/0x60) from [<c0301f20>] (st_reg_completion_cb+0x30/0x38) [ 146.045104] r6:d6ef5cd0 r5:00000092 r4:d617dd40 [ 146.050384] [<c0301ef0>] (st_reg_completion_cb+0x0/0x38) from [<c02ffd34>] (st_reg_complete+0x60/0xa8) [ 146.060516] r5:d6ef5cc4 r4:00000004 [ 146.064575] [<c02ffcd4>] (st_reg_complete+0x0/0xa8) from [<c02fffac>] (st_register+0x230/0x324) [ 146.074066] [<c02ffd7c>] (st_register+0x0/0x324) from [<c0323cac>] (nfc_drv_open+0xe8/0x1e4) [ 146.083251] r7:c8ec3840 r6:c0a5f89c r5:00000000 r4:c6016140 [ 146.089752] [<c0323bc4>] (nfc_drv_open+0x0/0x1e4) from [<c0117aec>] (chrdev_open+0x9c/0x164) [ 146.098937] [<c0117a50>] (chrdev_open+0x0/0x164) from [<c0111b88>] (__dentry_open+0x200/0x2b8) [ 146.108306] r8:c0117a50 r7:d66b1b28 r6:d5e1e910 r5:d69806f0 r4:c8ec3840 [ 146.115997] [<c0111988>] (__dentry_open+0x0/0x2b8) from [<c0112c24>] (nameidata_to_filp+0x68/0x70) [ 146.125701] [<c0112bbc>] (nameidata_to_filp+0x0/0x70) from [<c01211ac>] (do_last.isra.20+0x150/0x6d4) [ 146.135711] r7:00000026 r6:00000000 r5:00020002 r4:c6ccfed8 [ 146.142272] [<c012105c>] (do_last.isra.20+0x0/0x6d4) from [<c0121954>] (path_openat+0xc0/0x3b8) [ 146.151672] [<c0121894>] (path_openat+0x0/0x3b8) from [<c0121d5c>] (do_filp_open+0x34/0x88) [ 146.160766] [<c0121d28>] (do_filp_open+0x0/0x88) from [<c0112d20>] (do_sys_open+0xf4/0x18c) [ 146.169830] r7:00000001 r6:00000027 r5:00020002 r4:d63d1000 [ 146.176330] [<c0112c2c>] (do_sys_open+0x0/0x18c) from [<c0112de0>] (sys_open+0x28/0x2c) [ 146.185058] [<c0112db8>] (sys_open+0x0/0x2c) from [<c0013680>] (ret_fast_syscall+0x0/0x30) [ 146.194061] Code: e1a08003 e50b2030 e157000c e59b9004 (e41c400c) [ Change-Id: I10085ef1b1bc91ce3be01e179aa995287af271f1 Signed-off-by: Oleksandr Kozaruk <oleksandr.kozaruk@ti.com>

This patch series provides the ability for cgroup subsystems to be compiled as modules both within and outside the kernel tree. This is mainly useful for classifiers and subsystems that hook into components that are already modules. cls_cgroup and blkio-cgroup serve as the example use cases for this feature. It provides an interface cgroup_load_subsys() and cgroup_unload_subsys() which modular subsystems can use to register and depart during runtime. The net_cls classifier subsystem serves as the example for a subsystem which can be converted into a module using these changes. Patch #1 sets up the subsys[] array so its contents can be dynamic as modules appear and (eventually) disappear. Iterations over the array are modified to handle when subsystems are absent, and the dynamic section of the array is protected by cgroup_mutex. Patch #2 implements an interface for modules to load subsystems, called cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module pointer in struct cgroup_subsys. Patch #3 adds a mechanism for unloading modular subsystems, which includes a more advanced rework of the rudimentary reference counting introduced in patch 2. Patch #4 modifies the net_cls subsystem, which already had some module declarations, to be configurable as a module, which also serves as a simple proof-of-concept. Part of implementing patches 2 and 4 involved updating css pointers in each css_set when the module appears or leaves. In doing this, it was discovered that css_sets always remain linked to the dummy cgroup, regardless of whether or not any subsystems are actually bound to it (i.e., not mounted on an actual hierarchy). The subsystem loading and unloading code therefore should keep in mind the special cases where the added subsystem is the only one in the dummy cgroup (and therefore all css_sets need to be linked back into it) and where the removed subsys was the only one in the dummy cgroup (and therefore all css_sets should be unlinked from it) - however, as all css_sets always stay attached to the dummy cgroup anyway, these cases are ignored. Any fix that addresses this issue should also make sure these cases are addressed in the subsystem loading and unloading code. This patch: Make subsys[] able to be dynamically populated to support modular subsystems This patch reworks the way the subsys[] array is used so that subsystems can register themselves after boot time, and enables the internals of cgroups to be able to handle when subsystems are not present or may appear/disappear. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit 62b61f6 ("ksm: memory hotremove migration only") caused the following new lockdep warning. ======================================================= [ INFO: possible circular locking dependency detected ] ------------------------------------------------------- bash/1621 is trying to acquire lock: ((memory_chain).rwsem){.+.+.+}, at: [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0 but task is already holding lock: (ksm_thread_mutex){+.+.+.}, at: [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> Quarx2k#1 (ksm_thread_mutex){+.+.+.}: [<ffffffff8108b70a>] lock_acquire+0xaa/0x140 [<ffffffff81505d74>] __mutex_lock_common+0x44/0x3f0 [<ffffffff81506228>] mutex_lock_nested+0x48/0x60 [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0 [<ffffffff8150c21c>] notifier_call_chain+0x8c/0xe0 [<ffffffff8107934e>] __blocking_notifier_call_chain+0x7e/0xc0 [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20 [<ffffffff813afbfb>] memory_notify+0x1b/0x20 [<ffffffff81141b7c>] remove_memory+0x1cc/0x5f0 [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0 [<ffffffff813afd62>] store_mem_state+0xe2/0xf0 [<ffffffff813a0bb0>] sysdev_store+0x20/0x30 [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170 [<ffffffff8114f398>] vfs_write+0xc8/0x190 [<ffffffff8114fc14>] sys_write+0x54/0x90 [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b -> #0 ((memory_chain).rwsem){.+.+.+}: [<ffffffff8108b5ba>] __lock_acquire+0x155a/0x1600 [<ffffffff8108b70a>] lock_acquire+0xaa/0x140 [<ffffffff81506601>] down_read+0x51/0xa0 [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0 [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20 [<ffffffff813afbfb>] memory_notify+0x1b/0x20 [<ffffffff81141f1e>] remove_memory+0x56e/0x5f0 [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0 [<ffffffff813afd62>] store_mem_state+0xe2/0xf0 [<ffffffff813a0bb0>] sysdev_store+0x20/0x30 [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170 [<ffffffff8114f398>] vfs_write+0xc8/0x190 [<ffffffff8114fc14>] sys_write+0x54/0x90 [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b But it's a false positive. Both memory_chain.rwsem and ksm_thread_mutex have an outer lock (mem_hotplug_mutex). So they cannot deadlock. Thus, This patch annotate ksm_thread_mutex is not deadlock source. [akpm@linux-foundation.org: update comment, from Hugh] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I7526262f8657178b135bcb3889d9f0c43d0f20a8

commit 62b61f6 ("ksm: memory hotremove migration only") caused the following new lockdep warning. ======================================================= [ INFO: possible circular locking dependency detected ] ------------------------------------------------------- bash/1621 is trying to acquire lock: ((memory_chain).rwsem){.+.+.+}, at: [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0 but task is already holding lock: (ksm_thread_mutex){+.+.+.}, at: [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> Quarx2k#1 (ksm_thread_mutex){+.+.+.}: [<ffffffff8108b70a>] lock_acquire+0xaa/0x140 [<ffffffff81505d74>] __mutex_lock_common+0x44/0x3f0 [<ffffffff81506228>] mutex_lock_nested+0x48/0x60 [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0 [<ffffffff8150c21c>] notifier_call_chain+0x8c/0xe0 [<ffffffff8107934e>] __blocking_notifier_call_chain+0x7e/0xc0 [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20 [<ffffffff813afbfb>] memory_notify+0x1b/0x20 [<ffffffff81141b7c>] remove_memory+0x1cc/0x5f0 [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0 [<ffffffff813afd62>] store_mem_state+0xe2/0xf0 [<ffffffff813a0bb0>] sysdev_store+0x20/0x30 [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170 [<ffffffff8114f398>] vfs_write+0xc8/0x190 [<ffffffff8114fc14>] sys_write+0x54/0x90 [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b -> #0 ((memory_chain).rwsem){.+.+.+}: [<ffffffff8108b5ba>] __lock_acquire+0x155a/0x1600 [<ffffffff8108b70a>] lock_acquire+0xaa/0x140 [<ffffffff81506601>] down_read+0x51/0xa0 [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0 [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20 [<ffffffff813afbfb>] memory_notify+0x1b/0x20 [<ffffffff81141f1e>] remove_memory+0x56e/0x5f0 [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0 [<ffffffff813afd62>] store_mem_state+0xe2/0xf0 [<ffffffff813a0bb0>] sysdev_store+0x20/0x30 [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170 [<ffffffff8114f398>] vfs_write+0xc8/0x190 [<ffffffff8114fc14>] sys_write+0x54/0x90 [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b But it's a false positive. Both memory_chain.rwsem and ksm_thread_mutex have an outer lock (mem_hotplug_mutex). So they cannot deadlock. Thus, This patch annotate ksm_thread_mutex is not deadlock source. [akpm@linux-foundation.org: update comment, from Hugh] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

In 2.6.34-rc1, removing vhost_net module causes an oops in sync_mm_rss (called from do_exit) when workqueue is destroyed. This does not happen on net-next, or with vhost on top of to 2.6.33. The issue seems to be introduced by 34e5523 ("mm: avoid false sharing of mm_counter) which added sync_mm_rss() that is passed task->mm, and dereferences it without checking. If task is a kernel thread, mm might be NULL. I think this might also happen e.g. with aio. This patch fixes the oops by calling sync_mm_rss when task->mm is set to NULL. I also added BUG_ON to detect any other cases where counters get incremented while mm is NULL. The oops I observed looks like this: BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8 IP: [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f PGD 0 Oops: 0002 [Quarx2k#1] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 2 Modules linked in: vhost_net(-) tun bridge stp sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel kvm i5000_edac edac_core rtc_cmos bnx2 button i2c_i801 i2c_core rtc_core e1000e sg joydev ide_cd_mod serio_raw pcspkr rtc_lib cdrom virtio_net virtio_blk virtio_pci virtio_ring virtio af_packet e1000 shpchp aacraid uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 2046, comm: vhost Not tainted 2.6.34-rc1-vhost #25 System Planar/IBM System x3550 -[7978B3G]- RIP: 0010:[<ffffffff810b436d>] [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f RSP: 0018:ffff8802379b7e60 EFLAGS: 00010202 RAX: 0000000000000008 RBX: ffff88023f2390c0 RCX: 0000000000000000 RDX: ffff88023f2396b0 RSI: 0000000000000000 RDI: ffff88023f2390c0 RBP: ffff8802379b7e60 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88023aecfbc0 R11: 0000000000013240 R12: 0000000000000000 R13: ffffffff81051a6c R14: ffffe8ffffc0f540 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000002a8 CR3: 000000023af23000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process vhost (pid: 2046, threadinfo ffff8802379b6000, task ffff88023f2390c0) Stack: ffff8802379b7ee0 ffffffff81040687 ffffe8ffffc0f558 ffffffffa00a3e2d <0> 0000000000000000 ffff88023f2390c0 ffffffff81055817 ffff8802379b7e98 <0> ffff8802379b7e98 0000000100000286 ffff8802379b7ee0 ffff88023ad47d78 Call Trace: [<ffffffff81040687>] do_exit+0x147/0x6c4 [<ffffffffa00a3e2d>] ? handle_rx_net+0x0/0x17 [vhost_net] [<ffffffff81055817>] ? autoremove_wake_function+0x0/0x39 [<ffffffff81051a6c>] ? worker_thread+0x0/0x229 [<ffffffff810553c9>] kthreadd+0x0/0xf2 [<ffffffff810038d4>] kernel_thread_helper+0x4/0x10 [<ffffffff81055342>] ? kthread+0x0/0x87 [<ffffffff810038d0>] ? kernel_thread_helper+0x0/0x10 Code: 00 8b 87 6c 02 00 00 85 c0 74 14 48 98 f0 48 01 86 a0 02 00 00 c7 87 6c 02 00 00 00 00 00 00 8b 87 70 02 00 00 85 c0 74 14 48 98 <f0> 48 01 86 a8 02 00 00 c7 87 70 02 00 00 00 00 00 00 8b 87 74 RIP [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f RSP <ffff8802379b7e60> CR2: 00000000000002a8 ---[ end trace 41603ba922beddd2 ]--- Fixing recursive fault but reboot is needed! (note: handle_rx_net is a work item using workqueue in question). sync_mm_rss+0x33/0x6f gave me a hint. I also tried reverting 34e5523 and the oops goes away. The module in question calls use_mm and later unuse_mm from a kernel thread. It is when this kernel thread is destroyed that the crash happens. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

An issue was observed when a userspace task exits. The page which hits error here is the zero page. In binder mmap, the whole of vma is not mapped. On a task crash, when debuggerd reads the binder regions, the unmapped areas fall to do_anonymous_page in handle_pte_fault, due to the absence of a vm_fault handler. This results in zero page being mapped. Later in zap_pte_range, vm_normal_page returns zero page in the case of VM_MIXEDMAP and it results in the error. BUG: Bad page map in process mediaserver pte:9dff379f pmd:9bfbd831 page:c0ed8e60 count:1 mapcount:-1 mapping: (null) index:0x0 page flags: 0x404(referenced|reserved) addr:40c3f000 vm_flags:10220051 anon_vma: (null) mapping:d9fe0764 index:fd vma->vm_ops->fault: (null) vma->vm_file->f_op->mmap: binder_mmap+0x0/0x274 CPU: 0 PID: 1463 Comm: mediaserver Tainted: G W 3.10.17+ Quarx2k#1 [<c001549c>] (unwind_backtrace+0x0/0x11c) from [<c001200c>] (show_stack+0x10/0x14) [<c001200c>] (show_stack+0x10/0x14) from [<c0103d78>] (print_bad_pte+0x158/0x190) [<c0103d78>] (print_bad_pte+0x158/0x190) from [<c01055f0>] (unmap_single_vma+0x2e4/0x598) [<c01055f0>] (unmap_single_vma+0x2e4/0x598) from [<c010618c>] (unmap_vmas+0x34/0x50) [<c010618c>] (unmap_vmas+0x34/0x50) from [<c010a9e4>] (exit_mmap+0xc8/0x1e8) [<c010a9e4>] (exit_mmap+0xc8/0x1e8) from [<c00520f0>] (mmput+0x54/0xd0) [<c00520f0>] (mmput+0x54/0xd0) from [<c005972c>] (do_exit+0x360/0x990) [<c005972c>] (do_exit+0x360/0x990) from [<c0059ef0>] (do_group_exit+0x84/0xc0) [<c0059ef0>] (do_group_exit+0x84/0xc0) from [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) from [<c0011500>] (do_signal+0xa8/0x3b8) Add a vm_fault handler which returns VM_FAULT_SIGBUS, and prevents the wrong fallback to do_anonymous_page. Signed-off-by: Vinayak Menon <vinayakm.list@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

find_task_by_vpid() says "Must be called under rcu_read_lock().". But due to commit 3120438 "rcu: Disable lockdep checking in RCU list-traversal primitives", we are currently unable to catch "find_task_by_vpid() with tasklist_lock held but RCU lock not held" errors due to the RCU-lockdep checks being suppressed in the RCU variants of the struct list_head traversals. This commit therefore places an explicit check for being in an RCU read-side critical section in find_task_by_pid_ns(). =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- kernel/pid.c:386 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 1 lock held by rc.sysinit/1102: #0: (tasklist_lock){.+.+..}, at: [<c1048340>] sys_setpgid+0x40/0x160 stack backtrace: Pid: 1102, comm: rc.sysinit Not tainted 2.6.35-rc3-dirty Quarx2k#1 Call Trace: [<c105e714>] lockdep_rcu_dereference+0x94/0xb0 [<c104b4cd>] find_task_by_pid_ns+0x6d/0x70 [<c104b4e8>] find_task_by_vpid+0x18/0x20 [<c1048347>] sys_setpgid+0x47/0x160 [<c1002b50>] sysenter_do_call+0x12/0x36 Commit updated to use a new rcu_lockdep_assert() exported API rather than the old internal __do_rcu_dereference(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>

In 2.6.34-rc1, removing vhost_net module causes an oops in sync_mm_rss (called from do_exit) when workqueue is destroyed. This does not happen on net-next, or with vhost on top of to 2.6.33. The issue seems to be introduced by 34e5523 ("mm: avoid false sharing of mm_counter) which added sync_mm_rss() that is passed task->mm, and dereferences it without checking. If task is a kernel thread, mm might be NULL. I think this might also happen e.g. with aio. This patch fixes the oops by calling sync_mm_rss when task->mm is set to NULL. I also added BUG_ON to detect any other cases where counters get incremented while mm is NULL. The oops I observed looks like this: BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8 IP: [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f PGD 0 Oops: 0002 [Quarx2k#1] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 2 Modules linked in: vhost_net(-) tun bridge stp sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel kvm i5000_edac edac_core rtc_cmos bnx2 button i2c_i801 i2c_core rtc_core e1000e sg joydev ide_cd_mod serio_raw pcspkr rtc_lib cdrom virtio_net virtio_blk virtio_pci virtio_ring virtio af_packet e1000 shpchp aacraid uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 2046, comm: vhost Not tainted 2.6.34-rc1-vhost #25 System Planar/IBM System x3550 -[7978B3G]- RIP: 0010:[<ffffffff810b436d>] [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f RSP: 0018:ffff8802379b7e60 EFLAGS: 00010202 RAX: 0000000000000008 RBX: ffff88023f2390c0 RCX: 0000000000000000 RDX: ffff88023f2396b0 RSI: 0000000000000000 RDI: ffff88023f2390c0 RBP: ffff8802379b7e60 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88023aecfbc0 R11: 0000000000013240 R12: 0000000000000000 R13: ffffffff81051a6c R14: ffffe8ffffc0f540 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000002a8 CR3: 000000023af23000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process vhost (pid: 2046, threadinfo ffff8802379b6000, task ffff88023f2390c0) Stack: ffff8802379b7ee0 ffffffff81040687 ffffe8ffffc0f558 ffffffffa00a3e2d <0> 0000000000000000 ffff88023f2390c0 ffffffff81055817 ffff8802379b7e98 <0> ffff8802379b7e98 0000000100000286 ffff8802379b7ee0 ffff88023ad47d78 Call Trace: [<ffffffff81040687>] do_exit+0x147/0x6c4 [<ffffffffa00a3e2d>] ? handle_rx_net+0x0/0x17 [vhost_net] [<ffffffff81055817>] ? autoremove_wake_function+0x0/0x39 [<ffffffff81051a6c>] ? worker_thread+0x0/0x229 [<ffffffff810553c9>] kthreadd+0x0/0xf2 [<ffffffff810038d4>] kernel_thread_helper+0x4/0x10 [<ffffffff81055342>] ? kthread+0x0/0x87 [<ffffffff810038d0>] ? kernel_thread_helper+0x0/0x10 Code: 00 8b 87 6c 02 00 00 85 c0 74 14 48 98 f0 48 01 86 a0 02 00 00 c7 87 6c 02 00 00 00 00 00 00 8b 87 70 02 00 00 85 c0 74 14 48 98 <f0> 48 01 86 a8 02 00 00 c7 87 70 02 00 00 00 00 00 00 8b 87 74 RIP [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f RSP <ffff8802379b7e60> CR2: 00000000000002a8 ---[ end trace 41603ba922beddd2 ]--- Fixing recursive fault but reboot is needed! (note: handle_rx_net is a work item using workqueue in question). sync_mm_rss+0x33/0x6f gave me a hint. I also tried reverting 34e5523 and the oops goes away. The module in question calls use_mm and later unuse_mm from a kernel thread. It is when this kernel thread is destroyed that the crash happens. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

An issue was observed when a userspace task exits. The page which hits error here is the zero page. In binder mmap, the whole of vma is not mapped. On a task crash, when debuggerd reads the binder regions, the unmapped areas fall to do_anonymous_page in handle_pte_fault, due to the absence of a vm_fault handler. This results in zero page being mapped. Later in zap_pte_range, vm_normal_page returns zero page in the case of VM_MIXEDMAP and it results in the error. BUG: Bad page map in process mediaserver pte:9dff379f pmd:9bfbd831 page:c0ed8e60 count:1 mapcount:-1 mapping: (null) index:0x0 page flags: 0x404(referenced|reserved) addr:40c3f000 vm_flags:10220051 anon_vma: (null) mapping:d9fe0764 index:fd vma->vm_ops->fault: (null) vma->vm_file->f_op->mmap: binder_mmap+0x0/0x274 CPU: 0 PID: 1463 Comm: mediaserver Tainted: G W 3.10.17+ Quarx2k#1 [<c001549c>] (unwind_backtrace+0x0/0x11c) from [<c001200c>] (show_stack+0x10/0x14) [<c001200c>] (show_stack+0x10/0x14) from [<c0103d78>] (print_bad_pte+0x158/0x190) [<c0103d78>] (print_bad_pte+0x158/0x190) from [<c01055f0>] (unmap_single_vma+0x2e4/0x598) [<c01055f0>] (unmap_single_vma+0x2e4/0x598) from [<c010618c>] (unmap_vmas+0x34/0x50) [<c010618c>] (unmap_vmas+0x34/0x50) from [<c010a9e4>] (exit_mmap+0xc8/0x1e8) [<c010a9e4>] (exit_mmap+0xc8/0x1e8) from [<c00520f0>] (mmput+0x54/0xd0) [<c00520f0>] (mmput+0x54/0xd0) from [<c005972c>] (do_exit+0x360/0x990) [<c005972c>] (do_exit+0x360/0x990) from [<c0059ef0>] (do_group_exit+0x84/0xc0) [<c0059ef0>] (do_group_exit+0x84/0xc0) from [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) from [<c0011500>] (do_signal+0xa8/0x3b8) Add a vm_fault handler which returns VM_FAULT_SIGBUS, and prevents the wrong fallback to do_anonymous_page. Signed-off-by: Vinayak Menon <vinayakm.list@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

In 2.6.34-rc1, removing vhost_net module causes an oops in sync_mm_rss (called from do_exit) when workqueue is destroyed. This does not happen on net-next, or with vhost on top of to 2.6.33. The issue seems to be introduced by 34e5523 ("mm: avoid false sharing of mm_counter) which added sync_mm_rss() that is passed task->mm, and dereferences it without checking. If task is a kernel thread, mm might be NULL. I think this might also happen e.g. with aio. This patch fixes the oops by calling sync_mm_rss when task->mm is set to NULL. I also added BUG_ON to detect any other cases where counters get incremented while mm is NULL. The oops I observed looks like this: BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8 IP: [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f PGD 0 Oops: 0002 [Quarx2k#1] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 2 Modules linked in: vhost_net(-) tun bridge stp sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel kvm i5000_edac edac_core rtc_cmos bnx2 button i2c_i801 i2c_core rtc_core e1000e sg joydev ide_cd_mod serio_raw pcspkr rtc_lib cdrom virtio_net virtio_blk virtio_pci virtio_ring virtio af_packet e1000 shpchp aacraid uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 2046, comm: vhost Not tainted 2.6.34-rc1-vhost #25 System Planar/IBM System x3550 -[7978B3G]- RIP: 0010:[<ffffffff810b436d>] [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f RSP: 0018:ffff8802379b7e60 EFLAGS: 00010202 RAX: 0000000000000008 RBX: ffff88023f2390c0 RCX: 0000000000000000 RDX: ffff88023f2396b0 RSI: 0000000000000000 RDI: ffff88023f2390c0 RBP: ffff8802379b7e60 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88023aecfbc0 R11: 0000000000013240 R12: 0000000000000000 R13: ffffffff81051a6c R14: ffffe8ffffc0f540 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000002a8 CR3: 000000023af23000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process vhost (pid: 2046, threadinfo ffff8802379b6000, task ffff88023f2390c0) Stack: ffff8802379b7ee0 ffffffff81040687 ffffe8ffffc0f558 ffffffffa00a3e2d <0> 0000000000000000 ffff88023f2390c0 ffffffff81055817 ffff8802379b7e98 <0> ffff8802379b7e98 0000000100000286 ffff8802379b7ee0 ffff88023ad47d78 Call Trace: [<ffffffff81040687>] do_exit+0x147/0x6c4 [<ffffffffa00a3e2d>] ? handle_rx_net+0x0/0x17 [vhost_net] [<ffffffff81055817>] ? autoremove_wake_function+0x0/0x39 [<ffffffff81051a6c>] ? worker_thread+0x0/0x229 [<ffffffff810553c9>] kthreadd+0x0/0xf2 [<ffffffff810038d4>] kernel_thread_helper+0x4/0x10 [<ffffffff81055342>] ? kthread+0x0/0x87 [<ffffffff810038d0>] ? kernel_thread_helper+0x0/0x10 Code: 00 8b 87 6c 02 00 00 85 c0 74 14 48 98 f0 48 01 86 a0 02 00 00 c7 87 6c 02 00 00 00 00 00 00 8b 87 70 02 00 00 85 c0 74 14 48 98 <f0> 48 01 86 a8 02 00 00 c7 87 70 02 00 00 00 00 00 00 8b 87 74 RIP [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f RSP <ffff8802379b7e60> CR2: 00000000000002a8 ---[ end trace 41603ba922beddd2 ]--- Fixing recursive fault but reboot is needed! (note: handle_rx_net is a work item using workqueue in question). sync_mm_rss+0x33/0x6f gave me a hint. I also tried reverting 34e5523 and the oops goes away. The module in question calls use_mm and later unuse_mm from a kernel thread. It is when this kernel thread is destroyed that the crash happens. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

An issue was observed when a userspace task exits. The page which hits error here is the zero page. In binder mmap, the whole of vma is not mapped. On a task crash, when debuggerd reads the binder regions, the unmapped areas fall to do_anonymous_page in handle_pte_fault, due to the absence of a vm_fault handler. This results in zero page being mapped. Later in zap_pte_range, vm_normal_page returns zero page in the case of VM_MIXEDMAP and it results in the error. BUG: Bad page map in process mediaserver pte:9dff379f pmd:9bfbd831 page:c0ed8e60 count:1 mapcount:-1 mapping: (null) index:0x0 page flags: 0x404(referenced|reserved) addr:40c3f000 vm_flags:10220051 anon_vma: (null) mapping:d9fe0764 index:fd vma->vm_ops->fault: (null) vma->vm_file->f_op->mmap: binder_mmap+0x0/0x274 CPU: 0 PID: 1463 Comm: mediaserver Tainted: G W 3.10.17+ Quarx2k#1 [<c001549c>] (unwind_backtrace+0x0/0x11c) from [<c001200c>] (show_stack+0x10/0x14) [<c001200c>] (show_stack+0x10/0x14) from [<c0103d78>] (print_bad_pte+0x158/0x190) [<c0103d78>] (print_bad_pte+0x158/0x190) from [<c01055f0>] (unmap_single_vma+0x2e4/0x598) [<c01055f0>] (unmap_single_vma+0x2e4/0x598) from [<c010618c>] (unmap_vmas+0x34/0x50) [<c010618c>] (unmap_vmas+0x34/0x50) from [<c010a9e4>] (exit_mmap+0xc8/0x1e8) [<c010a9e4>] (exit_mmap+0xc8/0x1e8) from [<c00520f0>] (mmput+0x54/0xd0) [<c00520f0>] (mmput+0x54/0xd0) from [<c005972c>] (do_exit+0x360/0x990) [<c005972c>] (do_exit+0x360/0x990) from [<c0059ef0>] (do_group_exit+0x84/0xc0) [<c0059ef0>] (do_group_exit+0x84/0xc0) from [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) from [<c0011500>] (do_signal+0xa8/0x3b8) Add a vm_fault handler which returns VM_FAULT_SIGBUS, and prevents the wrong fallback to do_anonymous_page. Signed-off-by: Vinayak Menon <vinayakm.list@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

We have to delay vfs_dq_claim_space() until allocation context destruction. Currently we have following call-trace: ext4_mb_new_blocks() /* task is already holding ac->alloc_semp */ ->ext4_mb_mark_diskspace_used ->vfs_dq_claim_space() /* acquire dqptr_sem here. Possible deadlock */ ->ext4_mb_release_context() /* drop ac->alloc_semp here */ Let's move quota claiming to ext4_da_update_reserve_space() ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.32-rc7 Quarx2k#18 ------------------------------------------------------- write-truncate-/3465 is trying to acquire lock: (&s->s_dquot.dqptr_sem){++++..}, at: [<c025e73b>] dquot_claim_space+0x3b/0x1b0 but task is already holding lock: (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> Quarx2k#3 (&meta_group_info[i]->alloc_sem){++++..}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 [<c02d0c1c>] ext4_mb_free_blocks+0x46c/0x870 [<c029c9d3>] ext4_free_blocks+0x73/0x130 [<c02c8cfc>] ext4_ext_truncate+0x76c/0x8d0 [<c02a8087>] ext4_truncate+0x187/0x5e0 [<c01e0f7b>] vmtruncate+0x6b/0x70 [<c022ec02>] inode_setattr+0x62/0x190 [<c02a2d7a>] ext4_setattr+0x25a/0x370 [<c022ee81>] notify_change+0x151/0x340 [<c021349d>] do_truncate+0x6d/0xa0 [<c0221034>] may_open+0x1d4/0x200 [<c022412b>] do_filp_open+0x1eb/0x910 [<c021244d>] do_sys_open+0x6d/0x140 [<c021258e>] sys_open+0x2e/0x40 [<c0103100>] sysenter_do_call+0x12/0x32 -> Quarx2k#2 (&ei->i_data_sem){++++..}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c02a5787>] ext4_get_blocks+0x47/0x450 [<c02a74c1>] ext4_getblk+0x61/0x1d0 [<c02a7a7f>] ext4_bread+0x1f/0xa0 [<c02bcddc>] ext4_quota_write+0x12c/0x310 [<c0262d23>] qtree_write_dquot+0x93/0x120 [<c0261708>] v2_write_dquot+0x28/0x30 [<c025d3fb>] dquot_commit+0xab/0xf0 [<c02be977>] ext4_write_dquot+0x77/0x90 [<c02be9bf>] ext4_mark_dquot_dirty+0x2f/0x50 [<c025e321>] dquot_alloc_inode+0x101/0x180 [<c029fec2>] ext4_new_inode+0x602/0xf00 [<c02ad789>] ext4_create+0x89/0x150 [<c0221ff2>] vfs_create+0xa2/0xc0 [<c02246e7>] do_filp_open+0x7a7/0x910 [<c021244d>] do_sys_open+0x6d/0x140 [<c021258e>] sys_open+0x2e/0x40 [<c0103100>] sysenter_do_call+0x12/0x32 -> Quarx2k#1 (&sb->s_type->i_mutex_key#7/4){+.+...}: [<c017d04b>] __lock_acquire+0xd7b/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0526505>] mutex_lock_nested+0x65/0x2d0 [<c0260c9d>] vfs_load_quota_inode+0x4bd/0x5a0 [<c02610af>] vfs_quota_on_path+0x5f/0x70 [<c02bc812>] ext4_quota_on+0x112/0x190 [<c026345a>] sys_quotactl+0x44a/0x8a0 [<c0103100>] sysenter_do_call+0x12/0x32 -> #0 (&s->s_dquot.dqptr_sem){++++..}: [<c017d361>] __lock_acquire+0x1091/0x1260 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c0527191>] down_read+0x51/0x90 [<c025e73b>] dquot_claim_space+0x3b/0x1b0 [<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380 [<c02d210a>] ext4_mb_new_blocks+0x34a/0x530 [<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0 [<c02a5966>] ext4_get_blocks+0x226/0x450 [<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0 [<c02a6ed6>] ext4_da_writepages+0x506/0x790 [<c01de272>] do_writepages+0x22/0x50 [<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80 [<c01d7b9b>] filemap_flush+0x2b/0x30 [<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60 [<c029e595>] ext4_release_file+0x75/0xb0 [<c0216b59>] __fput+0xf9/0x210 [<c0216c97>] fput+0x27/0x30 [<c02122dc>] filp_close+0x4c/0x80 [<c014510e>] put_files_struct+0x6e/0xd0 [<c01451b7>] exit_files+0x47/0x60 [<c0146a24>] do_exit+0x144/0x710 [<c0147028>] do_group_exit+0x38/0xa0 [<c0159abc>] get_signal_to_deliver+0x2ac/0x410 [<c0102849>] do_notify_resume+0xb9/0x890 [<c01032d2>] work_notifysig+0x13/0x21 other info that might help us debug this: 3 locks held by write-truncate-/3465: #0: (jbd2_handle){+.+...}, at: [<c02e1f8f>] start_this_handle+0x38f/0x5c0 Quarx2k#1: (&ei->i_data_sem){++++..}, at: [<c02a57f6>] ext4_get_blocks+0xb6/0x450 Quarx2k#2: (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370 stack backtrace: Pid: 3465, comm: write-truncate- Not tainted 2.6.32-rc7 Quarx2k#18 Call Trace: [<c0524cb3>] ? printk+0x1d/0x22 [<c017ac9a>] print_circular_bug+0xca/0xd0 [<c017d361>] __lock_acquire+0x1091/0x1260 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c017d5ea>] lock_acquire+0xba/0xd0 [<c025e73b>] ? dquot_claim_space+0x3b/0x1b0 [<c0527191>] down_read+0x51/0x90 [<c025e73b>] ? dquot_claim_space+0x3b/0x1b0 [<c025e73b>] dquot_claim_space+0x3b/0x1b0 [<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380 [<c02d210a>] ext4_mb_new_blocks+0x34a/0x530 [<c02c601d>] ? ext4_ext_find_extent+0x25d/0x280 [<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c052712c>] ? down_write+0x8c/0xa0 [<c02a5966>] ext4_get_blocks+0x226/0x450 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c017908b>] ? trace_hardirqs_off+0xb/0x10 [<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0 [<c01d69cc>] ? find_get_pages_tag+0x16c/0x180 [<c01d6860>] ? find_get_pages_tag+0x0/0x180 [<c02a73bd>] ? __mpage_da_writepage+0x16d/0x1a0 [<c01dfc4e>] ? pagevec_lookup_tag+0x2e/0x40 [<c01ddf1b>] ? write_cache_pages+0xdb/0x3d0 [<c02a7250>] ? __mpage_da_writepage+0x0/0x1a0 [<c02a6ed6>] ext4_da_writepages+0x506/0x790 [<c016beef>] ? cpu_clock+0x4f/0x60 [<c016bca2>] ? sched_clock_local+0xd2/0x170 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c016be60>] ? sched_clock_cpu+0x120/0x160 [<c02a69d0>] ? ext4_da_writepages+0x0/0x790 [<c01de272>] do_writepages+0x22/0x50 [<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80 [<c01d7b9b>] filemap_flush+0x2b/0x30 [<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60 [<c029e595>] ext4_release_file+0x75/0xb0 [<c0216b59>] __fput+0xf9/0x210 [<c0216c97>] fput+0x27/0x30 [<c02122dc>] filp_close+0x4c/0x80 [<c014510e>] put_files_struct+0x6e/0xd0 [<c01451b7>] exit_files+0x47/0x60 [<c0146a24>] do_exit+0x144/0x710 [<c017b163>] ? lock_release_holdtime+0x33/0x210 [<c0528137>] ? _spin_unlock_irq+0x27/0x30 [<c0147028>] do_group_exit+0x38/0xa0 [<c017babb>] ? trace_hardirqs_on+0xb/0x10 [<c0159abc>] get_signal_to_deliver+0x2ac/0x410 [<c0102849>] do_notify_resume+0xb9/0x890 [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0 [<c017b163>] ? lock_release_holdtime+0x33/0x210 [<c0165b50>] ? autoremove_wake_function+0x0/0x50 [<c017ba54>] ? trace_hardirqs_on_caller+0x134/0x190 [<c017babb>] ? trace_hardirqs_on+0xb/0x10 [<c0300ba4>] ? security_file_permission+0x14/0x20 [<c0215761>] ? vfs_write+0x131/0x190 [<c0214f50>] ? do_sync_write+0x0/0x120 [<c0103115>] ? sysenter_do_call+0x27/0x32 [<c01032d2>] work_notifysig+0x13/0x21 CC: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>

dump_tasks() needs to hold the RCU read lock around its access of the target task's UID. To this end it should use task_uid() as it only needs that one thing from the creds. The fact that dump_tasks() holds tasklist_lock is insufficient to prevent the target process replacing its credentials on another CPU. Then, this patch change to call rcu_read_lock() explicitly. =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- mm/oom_kill.c:410 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 4 locks held by kworker/1:2/651: #0: (events){+.+.+.}, at: [<ffffffff8106aae7>] process_one_work+0x137/0x4a0 Quarx2k#1: (moom_work){+.+...}, at: [<ffffffff8106aae7>] process_one_work+0x137/0x4a0 Quarx2k#2: (tasklist_lock){.+.+..}, at: [<ffffffff810fafd4>] out_of_memory+0x164/0x3f0 Quarx2k#3: (&(&p->alloc_lock)->rlock){+.+...}, at: [<ffffffff810fa48e>] find_lock_task_mm+0x2e/0x70 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

This patch series provides the ability for cgroup subsystems to be compiled as modules both within and outside the kernel tree. This is mainly useful for classifiers and subsystems that hook into components that are already modules. cls_cgroup and blkio-cgroup serve as the example use cases for this feature. It provides an interface cgroup_load_subsys() and cgroup_unload_subsys() which modular subsystems can use to register and depart during runtime. The net_cls classifier subsystem serves as the example for a subsystem which can be converted into a module using these changes. Patch Quarx2k#1 sets up the subsys[] array so its contents can be dynamic as modules appear and (eventually) disappear. Iterations over the array are modified to handle when subsystems are absent, and the dynamic section of the array is protected by cgroup_mutex. Patch Quarx2k#2 implements an interface for modules to load subsystems, called cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module pointer in struct cgroup_subsys. Patch Quarx2k#3 adds a mechanism for unloading modular subsystems, which includes a more advanced rework of the rudimentary reference counting introduced in patch 2. Patch Quarx2k#4 modifies the net_cls subsystem, which already had some module declarations, to be configurable as a module, which also serves as a simple proof-of-concept. Part of implementing patches 2 and 4 involved updating css pointers in each css_set when the module appears or leaves. In doing this, it was discovered that css_sets always remain linked to the dummy cgroup, regardless of whether or not any subsystems are actually bound to it (i.e., not mounted on an actual hierarchy). The subsystem loading and unloading code therefore should keep in mind the special cases where the added subsystem is the only one in the dummy cgroup (and therefore all css_sets need to be linked back into it) and where the removed subsys was the only one in the dummy cgroup (and therefore all css_sets should be unlinked from it) - however, as all css_sets always stay attached to the dummy cgroup anyway, these cases are ignored. Any fix that addresses this issue should also make sure these cases are addressed in the subsystem loading and unloading code. This patch: Make subsys[] able to be dynamically populated to support modular subsystems This patch reworks the way the subsys[] array is used so that subsystems can register themselves after boot time, and enables the internals of cgroups to be able to handle when subsystems are not present or may appear/disappear. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

espaciosalter20 added 16 commits October 5, 2012 17:23

Update drivers/cpufreq/Kconfig

ddc857d

added some new governors

Update drivers/cpufreq/Makefile

8536eb5

Added new governors

added new governors

90e3674

changes in netfilter

325c76e

added ZRAM support, new governors, sio scheduler, clock fixes

c963dc8

Update drivers/cpufreq/cpufreq_brazilianwax.c

e406834

Update drivers/cpufreq/cpufreq_savagedzen.c

e18fb61

Update drivers/cpufreq/cpufreq_smartass.c

4f34630

Update drivers/cpufreq/cpufreq_smartass2.c

5ff5ea9

Update drivers/cpufreq/cpufreq_smoothass.c

dade837

Update syntax on defconfig

f649381

Update drivers/misc/Kconfig

61f6743

Update arch/arm/plat-omap/Kconfig

8f7b38a

Update drivers/cpufreq/Kconfig

929c2ed

Update drivers/staging/Kconfig

c6f5f28

Update drivers/cpufreq/Kconfig

2b89b5a

Quarx2k added a commit that referenced this pull request Feb 3, 2013

cleanup usb #1

83e8fc5

Quarx2k closed this Feb 8, 2013

Quarx2k added a commit that referenced this pull request Feb 10, 2013

cleanup #1

c75a8da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.6.32 #1

2.6.32 #1

espaciosalter20 commented Oct 7, 2012

2.6.32 #1

2.6.32 #1

Conversation

espaciosalter20 commented Oct 7, 2012