Skip to content
Permalink
Rik-van-Riel/s…
Switch branches/tags

Commits on Apr 19, 2021

  1. sched,fair: skip newidle_balance if a wakeup is pending

    The try_to_wake_up function has an optimization where it can queue
    a task for wakeup on its previous CPU, if the task is still in the
    middle of going to sleep inside schedule().
    
    Once schedule() re-enables IRQs, the task will be woken up with an
    IPI, and placed back on the runqueue.
    
    If we have such a wakeup pending, there is no need to search other
    CPUs for runnable tasks. Just skip (or bail out early from) newidle
    balancing, and run the just woken up task.
    
    For a memcache like workload test, this reduces total CPU use by
    about 2%, proportionally split between user and system time,
    and p99 and p95 application response time by 2-3% on average.
    The schedstats run_delay number shows a similar improvement.
    
    Signed-off-by: Rik van Riel <riel@surriel.com>
    rikvanriel authored and intel-lab-lkp committed Apr 19, 2021

Commits on Apr 17, 2021

  1. sched/debug: Rename the sched_debug parameter to sched_verbose

    CONFIG_SCHED_DEBUG is the build-time Kconfig knob, the boot param
    sched_debug and the /debug/sched/debug_enabled knobs control the
    sched_debug_enabled variable, but what they really do is make
    SCHED_DEBUG more verbose, so rename the lot.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Peter Zijlstra committed Apr 17, 2021

Commits on Apr 16, 2021

  1. sched,fair: Alternative sched_slice()

    The current sched_slice() seems to have issues; there's two possible
    things that could be improved:
    
     - the 'nr_running' used for __sched_period() is daft when cgroups are
       considered. Using the RQ wide h_nr_running seems like a much more
       consistent number.
    
     - (esp) cgroups can slice it real fine, which makes for easy
       over-scheduling, ensure min_gran is what the name says.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210412102001.611897312@infradead.org
    Peter Zijlstra committed Apr 16, 2021
  2. sched: Move /proc/sched_debug to debugfs

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210412102001.548833671@infradead.org
    Peter Zijlstra committed Apr 16, 2021
  3. sched,debug: Convert sysctl sched_domains to debugfs

    Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/YHgB/s4KCBQ1ifdm@hirez.programming.kicks-ass.net
    Peter Zijlstra committed Apr 16, 2021
  4. debugfs: Implement debugfs_create_str()

    Implement debugfs_create_str() to easily display names and such in
    debugfs.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210412102001.415407080@infradead.org
    Peter Zijlstra committed Apr 16, 2021
  5. sched,preempt: Move preempt_dynamic to debug.c

    Move the #ifdef SCHED_DEBUG bits to kernel/sched/debug.c in order to
    collect all the debugfs bits.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210412102001.353833279@infradead.org
    Peter Zijlstra committed Apr 16, 2021
  6. sched: Move SCHED_DEBUG sysctl to debugfs

    Stop polluting sysctl with undocumented knobs that really are debug
    only, move them all to /debug/sched/ along with the existing
    /debug/sched_* files that already exist.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
    Peter Zijlstra committed Apr 16, 2021
  7. sched: Don't make LATENCYTOP select SCHED_DEBUG

    SCHED_DEBUG is not in fact required for LATENCYTOP, don't select it.
    
    Suggested-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210412102001.224578981@infradead.org
    Peter Zijlstra committed Apr 16, 2021
  8. sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG

    CONFIG_SCHEDSTATS does not depend on SCHED_DEBUG, it is inconsistent
    to have the sysctl depend on it.
    
    Suggested-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210412102001.161151631@infradead.org
    Peter Zijlstra committed Apr 16, 2021
  9. sched/numa: Allow runtime enabling/disabling of NUMA balance without …

    …SCHED_DEBUG
    
    The ability to enable/disable NUMA balancing is not a debugging feature
    and should not depend on CONFIG_SCHED_DEBUG.  For example, machines within
    a HPC cluster may disable NUMA balancing temporarily for some jobs and
    re-enable it for other jobs without needing to reboot.
    
    This patch removes the dependency on CONFIG_SCHED_DEBUG for
    kernel.numa_balancing sysctl. The other numa balancing related sysctls
    are left as-is because if they need to be tuned then it is more likely
    that NUMA balancing needs to be fixed instead.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210324133916.GQ15768@suse.de
    Mel Gorman authored and Peter Zijlstra committed Apr 16, 2021
  10. sched: Use cpu_dying() to fix balance_push vs hotplug-rollback

    Use the new cpu_dying() state to simplify and fix the balance_push()
    vs CPU hotplug rollback state.
    
    Specifically, we currently rely on notifiers sched_cpu_dying() /
    sched_cpu_activate() to terminate balance_push, however if the
    cpu_down() fails when we're past sched_cpu_deactivate(), it should
    terminate balance_push at that point and not wait until we hit
    sched_cpu_activate().
    
    Similarly, when cpu_up() fails and we're going back down, balance_push
    should be active, where it currently is not.
    
    So instead, make sure balance_push is enabled below SCHED_AP_ACTIVE
    (when !cpu_active()), and gate it's utility with cpu_dying().
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/YHgAYef83VQhKdC2@hirez.programming.kicks-ass.net
    Peter Zijlstra committed Apr 16, 2021
  11. cpumask: Introduce DYING mask

    Introduce a cpumask that indicates (for each CPU) what direction the
    CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
    an up fails and we do a roll-back down, it will accurately reflect the
    direction.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210310150109.151441252@infradead.org
    Peter Zijlstra committed Apr 16, 2021
  12. cpumask: Make cpu_{online,possible,present,active}() inline

    Prepare for addition of another mask. Primarily a code movement to
    avoid having to create more #ifdef, but while there, convert
    everything with an argument to an inline function.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210310150109.045447765@infradead.org
    Peter Zijlstra committed Apr 16, 2021

Commits on Apr 14, 2021

  1. rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()

    Commit ec9c82e ("rseq: uapi: Declare rseq_cs field as union,
    update includes") added regressions for our servers.
    
    Using copy_from_user() and clear_user() for 64bit values
    is suboptimal.
    
    We can use faster put_user() and get_user() on 64bit arches.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Link: https://lkml.kernel.org/r/20210413203352.71350-4-eric.dumazet@gmail.com
    neebe000 authored and Peter Zijlstra committed Apr 14, 2021
  2. rseq: Remove redundant access_ok()

    After commit 8f28177 ("rseq: Use get_user/put_user rather
    than __get_user/__put_user") we no longer need
    an access_ok() call from __rseq_handle_notify_resume()
    
    Mathieu pointed out the same cleanup can be done
    in rseq_syscall().
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Link: https://lkml.kernel.org/r/20210413203352.71350-3-eric.dumazet@gmail.com
    neebe000 authored and Peter Zijlstra committed Apr 14, 2021
  3. rseq: Optimize rseq_update_cpu_id()

    Two put_user() in rseq_update_cpu_id() are replaced
    by a pair of unsafe_put_user() with appropriate surroundings.
    
    This removes one stac/clac pair on x86 in fast path.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Link: https://lkml.kernel.org/r/20210413203352.71350-2-eric.dumazet@gmail.com
    neebe000 authored and Peter Zijlstra committed Apr 14, 2021
  4. signal: Allow tasks to cache one sigqueue struct

    The idea for this originates from the real time tree to make signal
    delivery for realtime applications more efficient. In quite some of these
    application scenarios a control tasks signals workers to start their
    computations. There is usually only one signal per worker on flight.  This
    works nicely as long as the kmem cache allocations do not hit the slow path
    and cause latencies.
    
    To cure this an optimistic caching was introduced (limited to RT tasks)
    which allows a task to cache a single sigqueue in a pointer in task_struct
    instead of handing it back to the kmem cache after consuming a signal. When
    the next signal is sent to the task then the cached sigqueue is used
    instead of allocating a new one. This solved the problem for this set of
    application scenarios nicely.
    
    The task cache is not preallocated so the first signal sent to a task goes
    always to the cache allocator. The cached sigqueue stays around until the
    task exits and is freed when task::sighand is dropped.
    
    After posting this solution for mainline the discussion came up whether
    this would be useful in general and should not be limited to realtime
    tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org
    
    One concern leading to the original limitation was to avoid a large amount
    of pointlessly cached sigqueues in alive tasks. The other concern was
    vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.
    
    The accounting problem is real, but on the other hand slightly academic.
    After gathering some statistics it turned out that after boot of a regular
    distro install there are less than 10 sigqueues cached in ~1500 tasks.
    
    In case of a 'mass fork and fire signal to child' scenario the extra 80
    bytes of memory per task are well in the noise of the overall memory
    consumption of the fork bomb.
    
    If this should be limited then this would need an extra counter in struct
    user, more atomic instructions and a seperate rlimit. Yet another tunable
    which is mostly unused.
    
    The caching is actually used. After boot and a full kernel compile on a
    64CPU machine with make -j128 the number of 'allocations' looks like this:
    
      From slab:	   23996
      From task cache: 52223
    
    I.e. it reduces the number of slab cache operations by ~68%.
    
    A typical pattern there is:
    
    <...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
    <...>-58488 __sigqueue_free:   cache ffff8881132df460
    <...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
      bash-1149 exit_task_sighand: free ffff8881132df460
      bash-1149 __sigqueue_free:   cache ffff8881103dc550
    
    The interesting sequence is that the exiting task 58488 grabs the sigqueue
    from bash's task cache to signal exit and bash sticks it back into it's own
    cache. Lather, rinse and repeat.
    
    The caching is probably not noticable for the general use case, but the
    benefit for latency sensitive applications is clear. While kmem caches are
    usually just serving from the fast path the slab merging (default) can
    depending on the usage pattern of the merged slabs cause occasional slow
    path allocations.
    
    The time spared per cached entry is a few micro seconds per signal which is
    not relevant for e.g. a kernel build, but for signal heavy workloads it's
    measurable.
    
    As there is no real downside of this caching mechanism making it
    unconditionally available is preferred over more conditional code or new
    magic tunables.
    
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
    Thomas Gleixner authored and Peter Zijlstra committed Apr 14, 2021
  5. signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()

    There is no point in having the conditional at the callsite.
    
    Just hand in the allocation mode flag to __sigqueue_alloc() and use it to
    initialize sigqueue::flags.
    
    No functional change.
    
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210322092258.898677147@linutronix.de
    Thomas Gleixner authored and Peter Zijlstra committed Apr 14, 2021

Commits on Apr 9, 2021

  1. sched/fair: Introduce a CPU capacity comparison helper

    During load-balance, groups classified as group_misfit_task are filtered
    out if they do not pass
    
      group_smaller_max_cpu_capacity(<candidate group>, <local group>);
    
    which itself employs fits_capacity() to compare the sgc->max_capacity of
    both groups.
    
    Due to the underlying margin, fits_capacity(X, 1024) will return false for
    any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
    {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
    CPUs, misfit migration will never intentionally upmigrate it to a CPU of
    higher capacity due to the aforementioned margin.
    
    One may argue the 20% margin of fits_capacity() is excessive in the advent
    of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
    is that fits_capacity() is meant to compare a utilization value to a
    capacity value, whereas here it is being used to compare two capacity
    values. As CPU capacity and task utilization have different dynamics, a
    sensible approach here would be to add a new helper dedicated to comparing
    CPU capacities.
    
    Also note that comparing capacity extrema of local and source sched_group's
    doesn't make much sense when at the day of the day the imbalance will be
    pulled by a known env->dst_cpu, whose capacity can be anywhere within the
    local group's capacity extrema.
    
    While at it, replace group_smaller_{min, max}_cpu_capacity() with
    comparisons of the source group's min/max capacity and the destination
    CPU's capacity.
    
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Qais Yousef <qais.yousef@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
    Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
    valschneider authored and Peter Zijlstra committed Apr 9, 2021
  2. sched/fair: Clean up active balance nr_balance_failed trickery

    When triggering an active load balance, sd->nr_balance_failed is set to
    such a value that any further can_migrate_task() using said sd will ignore
    the output of task_hot().
    
    This behaviour makes sense, as active load balance intentionally preempts a
    rq's running task to migrate it right away, but this asynchronous write is
    a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
    before the sd->nr_balance_failed write either becomes visible to the
    stopper's CPU or even happens on the CPU that appended the stopper work.
    
    Add a struct lb_env flag to denote active balancing, and use it in
    can_migrate_task(). Remove the sd->nr_balance_failed write that served the
    same purpose. Cleanup the LBF_DST_PINNED active balance special case.
    
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
    valschneider authored and Peter Zijlstra committed Apr 9, 2021
  3. sched/fair: Ignore percpu threads for imbalance pulls

    During load balance, LBF_SOME_PINNED will be set if any candidate task
    cannot be detached due to CPU affinity constraints. This can result in
    setting env->sd->parent->sgc->group_imbalance, which can lead to a group
    being classified as group_imbalanced (rather than any of the other, lower
    group_type) when balancing at a higher level.
    
    In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
    set due to per-CPU kthreads being the only other runnable tasks on any
    given rq. This results in changing the group classification during
    load-balance at higher levels when in reality there is nothing that can be
    done for this affinity constraint: per-CPU kthreads, as the name implies,
    don't get to move around (modulo hotplug shenanigans).
    
    It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
    with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
    an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
    we can leverage here to not set LBF_SOME_PINNED.
    
    Note that the aforementioned classification to group_imbalance (when
    nothing can be done) is especially problematic on big.LITTLE systems, which
    have a topology the likes of:
    
      DIE [          ]
      MC  [    ][    ]
           0  1  2  3
           L  L  B  B
    
      arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
    
    Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
    level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
    the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
    CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
    can significantly delay the upgmigration of said misfit tasks. Systems
    relying on ASYM_PACKING are likely to face similar issues.
    
    Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
    [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
    [Reword changelog]
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
    Lingutla Chandrasekhar authored and Peter Zijlstra committed Apr 9, 2021
  4. sched/fair: Bring back select_idle_smt(), but differently

    Mel Gorman did some nice work in 9fe1f12 ("sched/fair: Merge
    select_idle_core/cpu()"), resulting in the kernel being more efficient
    at finding an idle CPU, and in tasks spending less time waiting to be
    run, both according to the schedstats run_delay numbers, and according
    to measured application latencies. Yay.
    
    The flip side of this is that we see more task migrations (about 30%
    more), higher cache misses, higher memory bandwidth utilization, and
    higher CPU use, for the same number of requests/second.
    
    This is most pronounced on a memcache type workload, which saw a
    consistent 1-3% increase in total CPU use on the system, due to those
    increased task migrations leading to higher L2 cache miss numbers, and
    higher memory utilization. The exclusive L3 cache on Skylake does us
    no favors there.
    
    On our web serving workload, that effect is usually negligible.
    
    It appears that the increased number of CPU migrations is generally a
    good thing, since it leads to lower cpu_delay numbers, reflecting the
    fact that tasks get to run faster. However, the reduced locality and
    the corresponding increase in L2 cache misses hurts a little.
    
    The patch below appears to fix the regression, while keeping the
    benefit of the lower cpu_delay numbers, by reintroducing
    select_idle_smt with a twist: when a socket has no idle cores, check
    to see if the sibling of "prev" is idle, before searching all the
    other CPUs.
    
    This fixes both the occasional 9% regression on the web serving
    workload, and the continuous 2% CPU use regression on the memcache
    type workload.
    
    With Mel's patches and this patch together, task migrations are still
    high, but L2 cache misses, memory bandwidth, and CPU time used are
    back down to what they were before. The p95 and p99 response times for
    the memcache type application improve by about 10% over what they were
    before Mel's patches got merged.
    
    Signed-off-by: Rik van Riel <riel@surriel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
    rikvanriel authored and Peter Zijlstra committed Apr 9, 2021

Commits on Apr 8, 2021

  1. psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

    Currently only root can write files under /proc/pressure. Relax this to
    allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
    able to write to these files.
    
    Signed-off-by: Josh Hunt <johunt@akamai.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/20210402025833.27599-1-johunt@akamai.com
    geauxbears authored and Peter Zijlstra committed Apr 8, 2021

Commits on Mar 25, 2021

  1. sched/topology: Remove redundant cpumask_and() in init_overlap_sched_…

    …group()
    
    mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
    it must be a subset of sched_group_span(sg).
    
    So the cpumask_and() call is redundant - remove it.
    
    [ mingo: Adjusted the changelog a bit. ]
    
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
    Link: https://lore.kernel.org/r/20210325023140.23456-1-song.bao.hua@hisilicon.com
    Barry Song authored and Ingo Molnar committed Mar 25, 2021
  2. sched/core: Use -EINVAL in sched_dynamic_mode()

    -1 is -EPERM which is a somewhat odd error to return from
    sched_dynamic_write(). No other callers care about which negative
    value is used.
    
    Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Link: https://lore.kernel.org/r/20210325004515.531631-2-linux@rasmusvillemoes.dk
    Villemoes authored and Ingo Molnar committed Mar 25, 2021
  3. sched/core: Stop using magic values in sched_dynamic_mode()

    Use the enum names which are also what is used in the switch() in
    sched_dynamic_update().
    
    Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Link: https://lore.kernel.org/r/20210325004515.531631-1-linux@rasmusvillemoes.dk
    Villemoes authored and Ingo Molnar committed Mar 25, 2021

Commits on Mar 23, 2021

  1. sched/fair: Reduce long-tail newly idle balance cost

    A long-tail load balance cost is observed on the newly idle path,
    this is caused by a race window between the first nr_running check
    of the busiest runqueue and its nr_running recheck in detach_tasks.
    
    Before the busiest runqueue is locked, the tasks on the busiest
    runqueue could be pulled by other CPUs and nr_running of the busiest
    runqueu becomes 1 or even 0 if the running task becomes idle, this
    causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
    load_balance redo at the same sched_domain level.
    
    In order to find the new busiest sched_group and CPU, load balance will
    recompute and update the various load statistics, which eventually leads
    to the long-tail load balance cost.
    
    This patch clears LBF_ALL_PINNED flag for this race condition, and hence
    reduces the long-tail cost of newly idle balance.
    
    Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/1614154549-116078-1-git-send-email-aubrey.li@intel.com
    aubreyli authored and Peter Zijlstra committed Mar 23, 2021
  2. sched/fair: Optimize test_idle_cores() for !SMT

    update_idle_core() is only done for the case of sched_smt_present.
    but test_idle_cores() is done for all machines even those without
    SMT.
    
    This can contribute to up 8%+ hackbench performance loss on a
    machine like kunpeng 920 which has no SMT. This patch removes the
    redundant test_idle_cores() for !SMT machines.
    
    Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get
    an average.
    
      $ numactl -N 0 hackbench -p -T -l 20000 -g $1
    
    The below is the result of hackbench w/ and w/o this patch:
    
      g=    2      4     6       8      10     12      14
      w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
      w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
    			    +4.1%  +8.3%  +7.3%   +6.3%
    
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lkml.kernel.org/r/20210320221432.924-1-song.bao.hua@hisilicon.com
    Barry Song authored and Peter Zijlstra committed Mar 23, 2021
  3. psi: Reduce calls to sched_clock() in psi

    We noticed that the cost of psi increases with the increase in the
    levels of the cgroups. Particularly the cost of cpu_clock() sticks out
    as the kernel calls it multiple times as it traverses up the cgroup
    tree. This patch reduces the calls to cpu_clock().
    
    Performed perf bench on Intel Broadwell with 3 levels of cgroup.
    
    Before the patch:
    
    $ perf bench sched all
     # Running sched/messaging benchmark...
     # 20 sender and receiver processes per group
     # 10 groups == 400 processes run
    
         Total time: 0.747 [sec]
    
     # Running sched/pipe benchmark...
     # Executed 1000000 pipe operations between two processes
    
         Total time: 3.516 [sec]
    
           3.516689 usecs/op
             284358 ops/sec
    
    After the patch:
    
    $ perf bench sched all
     # Running sched/messaging benchmark...
     # 20 sender and receiver processes per group
     # 10 groups == 400 processes run
    
         Total time: 0.640 [sec]
    
     # Running sched/pipe benchmark...
     # Executed 1000000 pipe operations between two processes
    
         Total time: 3.329 [sec]
    
           3.329820 usecs/op
             300316 ops/sec
    
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shakeelb@google.com
    shakeelb authored and Peter Zijlstra committed Mar 23, 2021
  4. stop_machine: Add caller debug info to queue_stop_cpus_work

    Most callsites were covered by commit
    
      a8b62fd ("stop_machine: Add function and caller debug info")
    
    but this skipped queue_stop_cpus_work(). Add caller debug info to it.
    
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20201210163830.21514-2-valentin.schneider@arm.com
    valschneider authored and Peter Zijlstra committed Mar 23, 2021

Commits on Mar 21, 2021

  1. sched: Fix various typos

    Fix ~42 single-word typos in scheduler code comments.
    
    We have accumulated a few fun ones over the years. :-)
    
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ben Segall <bsegall@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: linux-kernel@vger.kernel.org
    Ingo Molnar committed Mar 21, 2021

Commits on Mar 17, 2021

  1. rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

    For userspace checkpoint and restore (C/R) a way of getting process state
    containing RSEQ configuration is needed.
    
    There are two ways this information is going to be used:
     - to re-enable RSEQ for threads which had it enabled before C/R
     - to detect if a thread was in a critical section during C/R
    
    Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
    using the address registered before C/R.
    
    Detection whether the thread is in a critical section during C/R is needed
    to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
    before registers are dumped itself doesn't cause RSEQ abort.
    Restoring the instruction pointer within the critical section is
    problematic because rseq_cs may get cleared before the control is passed
    to the migrated application code leading to RSEQ invariants not being
    preserved. C/R code will use RSEQ ABI address to find the abort handler
    to which the instruction pointer needs to be set.
    
    To achieve above goals expose the RSEQ ABI address and the signature value
    with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.
    
    This new ptrace request can also be used by debuggers so they are aware
    of stops within restartable sequences in progress.
    
    Signed-off-by: Piotr Figiel <figiel@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Michal Miroslaw <emmir@google.com>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20210226135156.1081606-1-figiel@google.com
    figiel authored and Thomas Gleixner committed Mar 17, 2021

Commits on Mar 10, 2021

  1. sched: Remove unnecessary variable from schedule_tail()

    Since 565790d (sched: Fix balance_callback(), 2020-05-11), there
    is no longer a need to reuse the result value of the call to finish_task_switch()
    inside schedule_tail(), therefore the variable used to hold that value
    (rq) is no longer needed.
    
    Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210306210739.1370486-1-eantoranz@gmail.com
    eantoranz authored and Peter Zijlstra committed Mar 10, 2021
  2. sched: Optimize __calc_delta()

    A significant portion of __calc_delta() time is spent in the loop
    shifting a u64 by 32 bits. Use `fls` instead of iterating.
    
    This is ~7x faster on benchmarks.
    
    The generic `fls` implementation (`generic_fls`) is still ~4x faster
    than the loop.
    Architectures that have a better implementation will make use of it. For
    example, on x86 we get an additional factor 2 in speed without dedicated
    implementation.
    
    On GCC, the asm versions of `fls` are about the same speed as the
    builtin. On Clang, the versions that use fls are more than twice as
    slow as the builtin. This is because the way the `fls` function is
    written, clang puts the value in memory:
    https://godbolt.org/z/EfMbYe. This bug is filed at
    https://bugs.llvm.org/show_bug.cgi?idI406.
    
    ```
    name                                   cpu/op
    BM_Calc<__calc_delta_loop>             9.57ms Â=B112%
    BM_Calc<__calc_delta_generic_fls>      2.36ms Â=B113%
    BM_Calc<__calc_delta_asm_fls>          2.45ms Â=B113%
    BM_Calc<__calc_delta_asm_fls_nomem>    1.66ms Â=B112%
    BM_Calc<__calc_delta_asm_fls64>        2.46ms Â=B113%
    BM_Calc<__calc_delta_asm_fls64_nomem>  1.34ms Â=B115%
    BM_Calc<__calc_delta_builtin>          1.32ms Â=B111%
    ```
    
    Signed-off-by: Clement Courbet <courbet@google.com>
    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com
    legrosbuffle authored and Peter Zijlstra committed Mar 10, 2021
Older