Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
sched: Have do_idle() call __schedule() without enabling preemption
Peter, I hope Thomas didn't lend you his frozen shark thrower. ;-) Let me explain the issue we have. A while back, I asked Paul if he could add a special RCU case, where the quiescent states are if a task goes into user land (it's fine if it's there), is sleeping (off the run queue) or voluntarily schedules. He implemented this for me where after calling synchronize_rcu_tasks() all tasks is in the quiescent state (user space or not on a run queue) or has done a voluntary schedule (calling __schedule(false)). The one issue is that this ignores the idle task, as the idle task is never in user space and is always on the run queue. The only thing we could hope for is if it calls schedule(false). Thus synchronize_rcu_tasks() ignores it. The reason I need this is to be able to have ftrace hooks that are created dynamically (like what perf does when it uses function tracing), to be able to use the optimize dynamic trampoline. What that means is, if there's a single callback for a function hook, a trampoline is created dynamically and that hook (fentry) calls the dynamic trampoline directly. That dynamic trampoline calls the function callback directly without caring about any other function hook that may be registered to other functions. If perf does function tracing and there's no other code doing any function tracing, then all the functions it traces will call this dynamically allocated trampoline which will call the perf function tracer directly. Otherwise it will call the default trampoline that calls a loop function to iterate over any registered function callbacks with ftrace. The issue occurs when perf is done tracing. Now we need to free this dynamically created trampoline. But in order to do that, we need to make sure all tasks are not executing on it. One thing ftrace does before freeing the trampoline after detaching the hook from the functions, is to schedule on every CPU, which will make sure everything is out of a preempt section. This is more severe than synchronize_sched() because ftrace is recorded in locations that RCU is not active (like going to idle), and synchronize_sched() is not good enough. But even scheduling on all CPUs is not good enough when the kernel itself is preemptive. In that case we use the synchronize_rcu_tasks() that covers those cases where a task has been preempted. Except for one! This brings us back to synchronize_rcu_tasks() ignoring the idle case. Where I trigger this: BUG: unable to handle kernel paging request at ffffffffa0230077 IP: 0xffffffffa0230077 PGD 2414067 PUD 2415063 PMD c463c067 PTE 0 Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal kvm_intel i915 snd_seq kvm snd_seq_device snd_pcm i2c_algo_bit snd_timer drm_kms_helper irqbypass syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_i801 snd soundcore wmi i2c_core video e1000e ptp pps_core CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc3-test+ torvalds#356 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 task: ffff8800cebbd0c0 task.stack: ffff8800cebd8000 RIP: 0010:0xffffffffa0230077 RSP: 0018:ffff8800cebdfd80 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8800cebbd0c0 RCX: ffffffff8121391e RDX: dffffc0000000000 RSI: ffffffff81ce7c78 RDI: ffff8800cebbf298 RBP: ffff8800cebdfe28 R08: 0000000000000003 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800cebbd0c0 R13: ffffffff828fbe90 R14: ffff8800d392c480 R15: ffffffff826cc380 FS: 0000000000000000(0000) GS:ffff8800d3900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa0230077 CR3: 0000000002413000 CR4: 00000000001406e0 Call Trace: ? debug_smp_processor_id+0x17/0x20 ? sched_ttwu_pending+0x79/0x190 ? schedule+0x5/0xe0 ? trace_hardirqs_on_caller+0x182/0x280 schedule+0x5/0xe0 schedule_preempt_disabled+0x18/0x30 ? schedule+0x5/0xe0 ? schedule_preempt_disabled+0x18/0x30 do_idle+0x172/0x220 That RIP of ffffffffa0230077 is on the trampoline. The return address is schedule+0x5 which happens to be from a call to fentry. Notice the Comm: swapper/2. I can trigger this with a few runs of my test case, pretty consistently. When I looked at the code, I don't see any reason for idle to ever enable preemption. It currently calls schedule_preempt_disabled(), which enables preemption, calls schedule() and then disables preemption. The schedule() calls sched_submit_work() which immediately returns because of the check if the task is running, and idle is always running. Then it does a loop of disabling preemption calling __schedule() then enabling preemption again, and checking need_resched(). The above bug happens in schedule_preempt_disabled, where preemption is enabled, then schedule() is called, which is traced by ftrace. But then an interrupt came in while the idle task was on the ftrace trampoline, and when the interrupt returned, it scheduled via the interrupt preempt scheduling and not the call to schedule itself, which I feel is inefficient anyway, as idle is going to call schedule again anyway. The trampoline is freed (because none of the synchronizations were able to detect idle was preempted on a trampoline), and when idle gets to run again, it crashes (it's executing code that no longer exists). The solution I'm proposing here is to create a schedule_idle() call, that is local to kernel/sched/ which do_idle() calls instead and this basically does exactly the same thing that schedule() does, except that it doesn't enable preemption. It calls __schedule() in a loop that checks for need_resched(). Not only does this solve the bug I'm triggering with trying to free dynamically created ftrace trampolines, it also makes the idle code a bit more efficient without having these spurious schedule calls when interrupts occur in this small window when preemption is enabled. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Loading branch information