Skip to content
Permalink
qianjun-kernel…
Switch branches/tags

Commits on Apr 13, 2021

  1. sched/fair:Reduce unnecessary check preempt in the sched tick

    If it has been determined that the current cpu need resched in the
    early stage of for_each_sched_entity, then there is no need to check
    preempt in the subsequent se->parent entity_tick.
    
    Signed-off-by: jun qian <qianjun.kernel@gmail.com>
    jun qian authored and intel-lab-lkp committed Apr 13, 2021

Commits on Apr 9, 2021

  1. sched/fair: Introduce a CPU capacity comparison helper

    During load-balance, groups classified as group_misfit_task are filtered
    out if they do not pass
    
      group_smaller_max_cpu_capacity(<candidate group>, <local group>);
    
    which itself employs fits_capacity() to compare the sgc->max_capacity of
    both groups.
    
    Due to the underlying margin, fits_capacity(X, 1024) will return false for
    any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
    {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
    CPUs, misfit migration will never intentionally upmigrate it to a CPU of
    higher capacity due to the aforementioned margin.
    
    One may argue the 20% margin of fits_capacity() is excessive in the advent
    of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
    is that fits_capacity() is meant to compare a utilization value to a
    capacity value, whereas here it is being used to compare two capacity
    values. As CPU capacity and task utilization have different dynamics, a
    sensible approach here would be to add a new helper dedicated to comparing
    CPU capacities.
    
    Also note that comparing capacity extrema of local and source sched_group's
    doesn't make much sense when at the day of the day the imbalance will be
    pulled by a known env->dst_cpu, whose capacity can be anywhere within the
    local group's capacity extrema.
    
    While at it, replace group_smaller_{min, max}_cpu_capacity() with
    comparisons of the source group's min/max capacity and the destination
    CPU's capacity.
    
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Qais Yousef <qais.yousef@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
    Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
    valschneider authored and Peter Zijlstra committed Apr 9, 2021
  2. sched/fair: Clean up active balance nr_balance_failed trickery

    When triggering an active load balance, sd->nr_balance_failed is set to
    such a value that any further can_migrate_task() using said sd will ignore
    the output of task_hot().
    
    This behaviour makes sense, as active load balance intentionally preempts a
    rq's running task to migrate it right away, but this asynchronous write is
    a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
    before the sd->nr_balance_failed write either becomes visible to the
    stopper's CPU or even happens on the CPU that appended the stopper work.
    
    Add a struct lb_env flag to denote active balancing, and use it in
    can_migrate_task(). Remove the sd->nr_balance_failed write that served the
    same purpose. Cleanup the LBF_DST_PINNED active balance special case.
    
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
    valschneider authored and Peter Zijlstra committed Apr 9, 2021
  3. sched/fair: Ignore percpu threads for imbalance pulls

    During load balance, LBF_SOME_PINNED will be set if any candidate task
    cannot be detached due to CPU affinity constraints. This can result in
    setting env->sd->parent->sgc->group_imbalance, which can lead to a group
    being classified as group_imbalanced (rather than any of the other, lower
    group_type) when balancing at a higher level.
    
    In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
    set due to per-CPU kthreads being the only other runnable tasks on any
    given rq. This results in changing the group classification during
    load-balance at higher levels when in reality there is nothing that can be
    done for this affinity constraint: per-CPU kthreads, as the name implies,
    don't get to move around (modulo hotplug shenanigans).
    
    It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
    with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
    an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
    we can leverage here to not set LBF_SOME_PINNED.
    
    Note that the aforementioned classification to group_imbalance (when
    nothing can be done) is especially problematic on big.LITTLE systems, which
    have a topology the likes of:
    
      DIE [          ]
      MC  [    ][    ]
           0  1  2  3
           L  L  B  B
    
      arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
    
    Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
    level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
    the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
    CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
    can significantly delay the upgmigration of said misfit tasks. Systems
    relying on ASYM_PACKING are likely to face similar issues.
    
    Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
    [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
    [Reword changelog]
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
    Lingutla Chandrasekhar authored and Peter Zijlstra committed Apr 9, 2021
  4. sched/fair: Bring back select_idle_smt(), but differently

    Mel Gorman did some nice work in 9fe1f12 ("sched/fair: Merge
    select_idle_core/cpu()"), resulting in the kernel being more efficient
    at finding an idle CPU, and in tasks spending less time waiting to be
    run, both according to the schedstats run_delay numbers, and according
    to measured application latencies. Yay.
    
    The flip side of this is that we see more task migrations (about 30%
    more), higher cache misses, higher memory bandwidth utilization, and
    higher CPU use, for the same number of requests/second.
    
    This is most pronounced on a memcache type workload, which saw a
    consistent 1-3% increase in total CPU use on the system, due to those
    increased task migrations leading to higher L2 cache miss numbers, and
    higher memory utilization. The exclusive L3 cache on Skylake does us
    no favors there.
    
    On our web serving workload, that effect is usually negligible.
    
    It appears that the increased number of CPU migrations is generally a
    good thing, since it leads to lower cpu_delay numbers, reflecting the
    fact that tasks get to run faster. However, the reduced locality and
    the corresponding increase in L2 cache misses hurts a little.
    
    The patch below appears to fix the regression, while keeping the
    benefit of the lower cpu_delay numbers, by reintroducing
    select_idle_smt with a twist: when a socket has no idle cores, check
    to see if the sibling of "prev" is idle, before searching all the
    other CPUs.
    
    This fixes both the occasional 9% regression on the web serving
    workload, and the continuous 2% CPU use regression on the memcache
    type workload.
    
    With Mel's patches and this patch together, task migrations are still
    high, but L2 cache misses, memory bandwidth, and CPU time used are
    back down to what they were before. The p95 and p99 response times for
    the memcache type application improve by about 10% over what they were
    before Mel's patches got merged.
    
    Signed-off-by: Rik van Riel <riel@surriel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
    rikvanriel authored and Peter Zijlstra committed Apr 9, 2021

Commits on Apr 8, 2021

  1. psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

    Currently only root can write files under /proc/pressure. Relax this to
    allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
    able to write to these files.
    
    Signed-off-by: Josh Hunt <johunt@akamai.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/20210402025833.27599-1-johunt@akamai.com
    geauxbears authored and Peter Zijlstra committed Apr 8, 2021

Commits on Mar 25, 2021

  1. sched/topology: Remove redundant cpumask_and() in init_overlap_sched_…

    …group()
    
    mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
    it must be a subset of sched_group_span(sg).
    
    So the cpumask_and() call is redundant - remove it.
    
    [ mingo: Adjusted the changelog a bit. ]
    
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
    Link: https://lore.kernel.org/r/20210325023140.23456-1-song.bao.hua@hisilicon.com
    Barry Song authored and Ingo Molnar committed Mar 25, 2021
  2. sched/core: Use -EINVAL in sched_dynamic_mode()

    -1 is -EPERM which is a somewhat odd error to return from
    sched_dynamic_write(). No other callers care about which negative
    value is used.
    
    Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Link: https://lore.kernel.org/r/20210325004515.531631-2-linux@rasmusvillemoes.dk
    Villemoes authored and Ingo Molnar committed Mar 25, 2021
  3. sched/core: Stop using magic values in sched_dynamic_mode()

    Use the enum names which are also what is used in the switch() in
    sched_dynamic_update().
    
    Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Link: https://lore.kernel.org/r/20210325004515.531631-1-linux@rasmusvillemoes.dk
    Villemoes authored and Ingo Molnar committed Mar 25, 2021

Commits on Mar 23, 2021

  1. sched/fair: Reduce long-tail newly idle balance cost

    A long-tail load balance cost is observed on the newly idle path,
    this is caused by a race window between the first nr_running check
    of the busiest runqueue and its nr_running recheck in detach_tasks.
    
    Before the busiest runqueue is locked, the tasks on the busiest
    runqueue could be pulled by other CPUs and nr_running of the busiest
    runqueu becomes 1 or even 0 if the running task becomes idle, this
    causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
    load_balance redo at the same sched_domain level.
    
    In order to find the new busiest sched_group and CPU, load balance will
    recompute and update the various load statistics, which eventually leads
    to the long-tail load balance cost.
    
    This patch clears LBF_ALL_PINNED flag for this race condition, and hence
    reduces the long-tail cost of newly idle balance.
    
    Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/1614154549-116078-1-git-send-email-aubrey.li@intel.com
    aubreyli authored and Peter Zijlstra committed Mar 23, 2021
  2. sched/fair: Optimize test_idle_cores() for !SMT

    update_idle_core() is only done for the case of sched_smt_present.
    but test_idle_cores() is done for all machines even those without
    SMT.
    
    This can contribute to up 8%+ hackbench performance loss on a
    machine like kunpeng 920 which has no SMT. This patch removes the
    redundant test_idle_cores() for !SMT machines.
    
    Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get
    an average.
    
      $ numactl -N 0 hackbench -p -T -l 20000 -g $1
    
    The below is the result of hackbench w/ and w/o this patch:
    
      g=    2      4     6       8      10     12      14
      w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
      w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
    			    +4.1%  +8.3%  +7.3%   +6.3%
    
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lkml.kernel.org/r/20210320221432.924-1-song.bao.hua@hisilicon.com
    Barry Song authored and Peter Zijlstra committed Mar 23, 2021
  3. psi: Reduce calls to sched_clock() in psi

    We noticed that the cost of psi increases with the increase in the
    levels of the cgroups. Particularly the cost of cpu_clock() sticks out
    as the kernel calls it multiple times as it traverses up the cgroup
    tree. This patch reduces the calls to cpu_clock().
    
    Performed perf bench on Intel Broadwell with 3 levels of cgroup.
    
    Before the patch:
    
    $ perf bench sched all
     # Running sched/messaging benchmark...
     # 20 sender and receiver processes per group
     # 10 groups == 400 processes run
    
         Total time: 0.747 [sec]
    
     # Running sched/pipe benchmark...
     # Executed 1000000 pipe operations between two processes
    
         Total time: 3.516 [sec]
    
           3.516689 usecs/op
             284358 ops/sec
    
    After the patch:
    
    $ perf bench sched all
     # Running sched/messaging benchmark...
     # 20 sender and receiver processes per group
     # 10 groups == 400 processes run
    
         Total time: 0.640 [sec]
    
     # Running sched/pipe benchmark...
     # Executed 1000000 pipe operations between two processes
    
         Total time: 3.329 [sec]
    
           3.329820 usecs/op
             300316 ops/sec
    
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shakeelb@google.com
    shakeelb authored and Peter Zijlstra committed Mar 23, 2021
  4. stop_machine: Add caller debug info to queue_stop_cpus_work

    Most callsites were covered by commit
    
      a8b62fd ("stop_machine: Add function and caller debug info")
    
    but this skipped queue_stop_cpus_work(). Add caller debug info to it.
    
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20201210163830.21514-2-valentin.schneider@arm.com
    valschneider authored and Peter Zijlstra committed Mar 23, 2021

Commits on Mar 21, 2021

  1. sched: Fix various typos

    Fix ~42 single-word typos in scheduler code comments.
    
    We have accumulated a few fun ones over the years. :-)
    
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ben Segall <bsegall@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: linux-kernel@vger.kernel.org
    Ingo Molnar committed Mar 21, 2021

Commits on Mar 17, 2021

  1. rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

    For userspace checkpoint and restore (C/R) a way of getting process state
    containing RSEQ configuration is needed.
    
    There are two ways this information is going to be used:
     - to re-enable RSEQ for threads which had it enabled before C/R
     - to detect if a thread was in a critical section during C/R
    
    Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
    using the address registered before C/R.
    
    Detection whether the thread is in a critical section during C/R is needed
    to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
    before registers are dumped itself doesn't cause RSEQ abort.
    Restoring the instruction pointer within the critical section is
    problematic because rseq_cs may get cleared before the control is passed
    to the migrated application code leading to RSEQ invariants not being
    preserved. C/R code will use RSEQ ABI address to find the abort handler
    to which the instruction pointer needs to be set.
    
    To achieve above goals expose the RSEQ ABI address and the signature value
    with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.
    
    This new ptrace request can also be used by debuggers so they are aware
    of stops within restartable sequences in progress.
    
    Signed-off-by: Piotr Figiel <figiel@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Michal Miroslaw <emmir@google.com>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20210226135156.1081606-1-figiel@google.com
    figiel authored and Thomas Gleixner committed Mar 17, 2021

Commits on Mar 10, 2021

  1. sched: Remove unnecessary variable from schedule_tail()

    Since 565790d (sched: Fix balance_callback(), 2020-05-11), there
    is no longer a need to reuse the result value of the call to finish_task_switch()
    inside schedule_tail(), therefore the variable used to hold that value
    (rq) is no longer needed.
    
    Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210306210739.1370486-1-eantoranz@gmail.com
    eantoranz authored and Peter Zijlstra committed Mar 10, 2021
  2. sched: Optimize __calc_delta()

    A significant portion of __calc_delta() time is spent in the loop
    shifting a u64 by 32 bits. Use `fls` instead of iterating.
    
    This is ~7x faster on benchmarks.
    
    The generic `fls` implementation (`generic_fls`) is still ~4x faster
    than the loop.
    Architectures that have a better implementation will make use of it. For
    example, on x86 we get an additional factor 2 in speed without dedicated
    implementation.
    
    On GCC, the asm versions of `fls` are about the same speed as the
    builtin. On Clang, the versions that use fls are more than twice as
    slow as the builtin. This is because the way the `fls` function is
    written, clang puts the value in memory:
    https://godbolt.org/z/EfMbYe. This bug is filed at
    https://bugs.llvm.org/show_bug.cgi?idI406.
    
    ```
    name                                   cpu/op
    BM_Calc<__calc_delta_loop>             9.57ms Â=B112%
    BM_Calc<__calc_delta_generic_fls>      2.36ms Â=B113%
    BM_Calc<__calc_delta_asm_fls>          2.45ms Â=B113%
    BM_Calc<__calc_delta_asm_fls_nomem>    1.66ms Â=B112%
    BM_Calc<__calc_delta_asm_fls64>        2.46ms Â=B113%
    BM_Calc<__calc_delta_asm_fls64_nomem>  1.34ms Â=B115%
    BM_Calc<__calc_delta_builtin>          1.32ms Â=B111%
    ```
    
    Signed-off-by: Clement Courbet <courbet@google.com>
    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com
    legrosbuffle authored and Peter Zijlstra committed Mar 10, 2021

Commits on Mar 6, 2021

  1. psi: Optimize task switch inside shared cgroups

    The commit 36b238d ("psi: Optimize switching tasks inside shared
    cgroups") only update cgroups whose state actually changes during a
    task switch only in task preempt case, not in task sleep case.
    
    We actually don't need to clear and set TSK_ONCPU state for common cgroups
    of next and prev task in sleep case, that can save many psi_group_change
    especially when most activity comes from one leaf cgroup.
    
    sleep before:
    psi_dequeue()
      while ((group = iterate_groups(prev)))  # all ancestors
        psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
    psi_task_switch()
      while ((group = iterate_groups(next)))  # all ancestors
        psi_group_change(next, .set=TSK_ONCPU)
    
    sleep after:
    psi_dequeue()
      nop
    psi_task_switch()
      while ((group = iterate_groups(next)))  # until (prev & next)
        psi_group_change(next, .set=TSK_ONCPU)
      while ((group = iterate_groups(prev)))  # all ancestors
        psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
    
    When a voluntary sleep switches to another task, we remove one call of
    psi_group_change() for every common cgroup ancestor of the two tasks.
    
    Co-developed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
    chengmingzhou authored and Ingo Molnar committed Mar 6, 2021
  2. psi: Pressure states are unlikely

    Move the unlikely branches out of line. This eliminates undesirable
    jumps during wakeup and sleeps for workloads that aren't under any
    sort of resource pressure.
    
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.com
    hnaz authored and Ingo Molnar committed Mar 6, 2021
  3. psi: Use ONCPU state tracking machinery to detect reclaim

    Move the reclaim detection from the timer tick to the task state
    tracking machinery using the recently added ONCPU state. And we
    also add task psi_flags changes checking in the psi_task_switch()
    optimization to update the parents properly.
    
    In terms of performance and cost, this ONCPU task state tracking
    is not cheaper than previous timer tick in aggregate. But the code is
    simpler and shorter this way, so it's a maintainability win. And
    Johannes did some testing with perf bench, the performace and cost
    changes would be acceptable for real workloads.
    
    Thanks to Johannes Weiner for pointing out the psi_task_switch()
    optimization things and the clearer changelog.
    
    Co-developed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.com
    chengmingzhou authored and Ingo Molnar committed Mar 6, 2021
  4. psi: Add PSI_CPU_FULL state

    The FULL state doesn't exist for the CPU resource at the system level,
    but exist at the cgroup level, means all non-idle tasks in a cgroup are
    delayed on the CPU resource which used by others outside of the cgroup
    or throttled by the cgroup cpu.max configuration.
    
    Co-developed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/20210303034659.91735-2-zhouchengming@bytedance.com
    chengmingzhou authored and Ingo Molnar committed Mar 6, 2021
  5. sched/topology: fix the issue groups don't span domain->span for NUMA…

    … diameter > 2
    
    As long as NUMA diameter > 2, building sched_domain by sibling's child
    domain will definitely create a sched_domain with sched_group which will
    span out of the sched_domain:
    
                   +------+         +------+        +-------+       +------+
                   | node |  12     |node  | 20     | node  |  12   |node  |
                   |  0   +---------+1     +--------+ 2     +-------+3     |
                   +------+         +------+        +-------+       +------+
    
    domain0        node0            node1            node2          node3
    
    domain1        node0+1          node0+1          node2+3        node2+3
                                                     +
    domain2        node0+1+2                         |
                 group: node0+1                      |
                   group:node2+3 <-------------------+
    
    when node2 is added into the domain2 of node0, kernel is using the child
    domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
    the span of the domain including node0+1+2.
    
    This will make load_balance() run based on screwed avg_load and group_type
    in the sched_group spanning out of the sched_domain, and it also makes
    select_task_rq_fair() pick an idle CPU outside the sched_domain.
    
    Real servers which suffer from this problem include Kunpeng920 and 8-node
    Sun Fire X4600-M2, at least.
    
    Here we move to use the *child* domain of the *child* domain of node2's
    domain2 as the new added sched_group. At the same, we re-use the lower
    level sgc directly.
                   +------+         +------+        +-------+       +------+
                   | node |  12     |node  | 20     | node  |  12   |node  |
                   |  0   +---------+1     +--------+ 2     +-------+3     |
                   +------+         +------+        +-------+       +------+
    
    domain0        node0            node1          +- node2          node3
                                                   |
    domain1        node0+1          node0+1        | node2+3        node2+3
                                                   |
    domain2        node0+1+2                       |
                 group: node0+1                    |
                   group:node2 <-------------------+
    
    While the lower level sgc is re-used, this patch only changes the remote
    sched_groups for those sched_domains playing grandchild trick, therefore,
    sgc->next_update is still safe since it's only touched by CPUs that have
    the group span as local group. And sgc->imbalance is also safe because
    sd_parent remains the same in load_balance and LB only tries other CPUs
    from the local group.
    Moreover, since local groups are not touched, they are still getting
    roughly equal size in a TL. And should_we_balance() only matters with
    local groups, so the pull probability of those groups are still roughly
    equal.
    
    Tested by the below topology:
    qemu-system-aarch64  -M virt -nographic \
     -smp cpus=8 \
     -numa node,cpus=0-1,nodeid=0 \
     -numa node,cpus=2-3,nodeid=1 \
     -numa node,cpus=4-5,nodeid=2 \
     -numa node,cpus=6-7,nodeid=3 \
     -numa dist,src=0,dst=1,val=12 \
     -numa dist,src=0,dst=2,val=20 \
     -numa dist,src=0,dst=3,val=22 \
     -numa dist,src=1,dst=2,val=22 \
     -numa dist,src=2,dst=3,val=12 \
     -numa dist,src=1,dst=3,val=24 \
     -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image
    
    w/o patch, we get lots of "groups don't span domain->span":
    [    0.802139] CPU0 attaching sched-domain(s):
    [    0.802193]  domain-0: span=0-1 level=MC
    [    0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
    [    0.802693]   domain-1: span=0-3 level=NUMA
    [    0.802731]    groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
    [    0.802811]    domain-2: span=0-5 level=NUMA
    [    0.802829]     groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
    [    0.802881] ERROR: groups don't span domain->span
    [    0.803058]     domain-3: span=0-7 level=NUMA
    [    0.803080]      groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
    [    0.804055] CPU1 attaching sched-domain(s):
    [    0.804072]  domain-0: span=0-1 level=MC
    [    0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
    [    0.804152]   domain-1: span=0-3 level=NUMA
    [    0.804170]    groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
    [    0.804219]    domain-2: span=0-5 level=NUMA
    [    0.804236]     groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
    [    0.804302] ERROR: groups don't span domain->span
    [    0.804520]     domain-3: span=0-7 level=NUMA
    [    0.804546]      groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
    [    0.804677] CPU2 attaching sched-domain(s):
    [    0.804687]  domain-0: span=2-3 level=MC
    [    0.804705]   groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 }
    [    0.804754]   domain-1: span=0-3 level=NUMA
    [    0.804772]    groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
    [    0.804820]    domain-2: span=0-5 level=NUMA
    [    0.804836]     groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
    [    0.804944] ERROR: groups don't span domain->span
    [    0.805108]     domain-3: span=0-7 level=NUMA
    [    0.805134]      groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
    [    0.805223] CPU3 attaching sched-domain(s):
    [    0.805232]  domain-0: span=2-3 level=MC
    [    0.805249]   groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 }
    [    0.805319]   domain-1: span=0-3 level=NUMA
    [    0.805336]    groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
    [    0.805383]    domain-2: span=0-5 level=NUMA
    [    0.805399]     groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
    [    0.805458] ERROR: groups don't span domain->span
    [    0.805605]     domain-3: span=0-7 level=NUMA
    [    0.805626]      groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
    [    0.805712] CPU4 attaching sched-domain(s):
    [    0.805721]  domain-0: span=4-5 level=MC
    [    0.805738]   groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 }
    [    0.805787]   domain-1: span=4-7 level=NUMA
    [    0.805803]    groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
    [    0.805851]    domain-2: span=0-1,4-7 level=NUMA
    [    0.805867]     groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
    [    0.805915] ERROR: groups don't span domain->span
    [    0.806108]     domain-3: span=0-7 level=NUMA
    [    0.806130]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
    [    0.806214] CPU5 attaching sched-domain(s):
    [    0.806222]  domain-0: span=4-5 level=MC
    [    0.806240]   groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 }
    [    0.806841]   domain-1: span=4-7 level=NUMA
    [    0.806866]    groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
    [    0.806934]    domain-2: span=0-1,4-7 level=NUMA
    [    0.806953]     groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
    [    0.807004] ERROR: groups don't span domain->span
    [    0.807312]     domain-3: span=0-7 level=NUMA
    [    0.807386]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
    [    0.807686] CPU6 attaching sched-domain(s):
    [    0.807710]  domain-0: span=6-7 level=MC
    [    0.807750]   groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 }
    [    0.807840]   domain-1: span=4-7 level=NUMA
    [    0.807870]    groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
    [    0.807952]    domain-2: span=0-1,4-7 level=NUMA
    [    0.807985]     groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
    [    0.808045] ERROR: groups don't span domain->span
    [    0.808257]     domain-3: span=0-7 level=NUMA
    [    0.808571]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 }
    [    0.808848] CPU7 attaching sched-domain(s):
    [    0.808860]  domain-0: span=6-7 level=MC
    [    0.808880]   groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 }
    [    0.808953]   domain-1: span=4-7 level=NUMA
    [    0.808974]    groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
    [    0.809034]    domain-2: span=0-1,4-7 level=NUMA
    [    0.809055]     groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
    [    0.809128] ERROR: groups don't span domain->span
    [    0.810361]     domain-3: span=0-7 level=NUMA
    [    0.810400]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 }
    
    w/ patch, we don't get "groups don't span domain->span" any more:
    [    1.486271] CPU0 attaching sched-domain(s):
    [    1.486820]  domain-0: span=0-1 level=MC
    [    1.500924]   groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 }
    [    1.515717]   domain-1: span=0-3 level=NUMA
    [    1.515903]    groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
    [    1.516989]    domain-2: span=0-5 level=NUMA
    [    1.517124]     groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
    [    1.517369]     domain-3: span=0-7 level=NUMA
    [    1.517423]      groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
    [    1.520027] CPU1 attaching sched-domain(s):
    [    1.520097]  domain-0: span=0-1 level=MC
    [    1.520184]   groups: 1:{ span=1 cap=994 }, 0:{ span=0 cap=980 }
    [    1.520429]   domain-1: span=0-3 level=NUMA
    [    1.520487]    groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
    [    1.520687]    domain-2: span=0-5 level=NUMA
    [    1.520744]     groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
    [    1.520948]     domain-3: span=0-7 level=NUMA
    [    1.521038]      groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
    [    1.522068] CPU2 attaching sched-domain(s):
    [    1.522348]  domain-0: span=2-3 level=MC
    [    1.522606]   groups: 2:{ span=2 cap=1003 }, 3:{ span=3 cap=986 }
    [    1.522832]   domain-1: span=0-3 level=NUMA
    [    1.522885]    groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
    [    1.523043]    domain-2: span=0-5 level=NUMA
    [    1.523092]     groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 }
    [    1.523302]     domain-3: span=0-7 level=NUMA
    [    1.523352]      groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 }
    [    1.523748] CPU3 attaching sched-domain(s):
    [    1.523774]  domain-0: span=2-3 level=MC
    [    1.523825]   groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 }
    [    1.524009]   domain-1: span=0-3 level=NUMA
    [    1.524086]    groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
    [    1.524281]    domain-2: span=0-5 level=NUMA
    [    1.524331]     groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 }
    [    1.524534]     domain-3: span=0-7 level=NUMA
    [    1.524586]      groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 }
    [    1.524847] CPU4 attaching sched-domain(s):
    [    1.524873]  domain-0: span=4-5 level=MC
    [    1.524954]   groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 }
    [    1.525105]   domain-1: span=4-7 level=NUMA
    [    1.525153]    groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
    [    1.525368]    domain-2: span=0-1,4-7 level=NUMA
    [    1.525428]     groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
    [    1.532726]     domain-3: span=0-7 level=NUMA
    [    1.532811]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=4037 }
    [    1.534125] CPU5 attaching sched-domain(s):
    [    1.534159]  domain-0: span=4-5 level=MC
    [    1.534303]   groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 }
    [    1.534490]   domain-1: span=4-7 level=NUMA
    [    1.534572]    groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
    [    1.534734]    domain-2: span=0-1,4-7 level=NUMA
    [    1.534783]     groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
    [    1.536057]     domain-3: span=0-7 level=NUMA
    [    1.536430]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=3896 }
    [    1.536815] CPU6 attaching sched-domain(s):
    [    1.536846]  domain-0: span=6-7 level=MC
    [    1.536934]   groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 }
    [    1.537144]   domain-1: span=4-7 level=NUMA
    [    1.537262]    groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
    [    1.537553]    domain-2: span=0-1,4-7 level=NUMA
    [    1.537613]     groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 }
    [    1.537872]     domain-3: span=0-7 level=NUMA
    [    1.537998]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
    [    1.538448] CPU7 attaching sched-domain(s):
    [    1.538505]  domain-0: span=6-7 level=MC
    [    1.538586]   groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 }
    [    1.538746]   domain-1: span=4-7 level=NUMA
    [    1.538798]    groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
    [    1.539048]    domain-2: span=0-1,4-7 level=NUMA
    [    1.539111]     groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 }
    [    1.539571]     domain-3: span=0-7 level=NUMA
    [    1.539610]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
    
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Tested-by: Meelis Roos <mroos@linux.ee>
    Link: https://lkml.kernel.org/r/20210224030944.15232-1-song.bao.hua@hisilicon.com
    Barry Song authored and Ingo Molnar committed Mar 6, 2021
  6. cpu/hotplug: Add cpuhp_invoke_callback_range()

    Factorizing and unifying cpuhp callback range invocations, especially for
    the hotunplug path, where two different ways of decrementing were used. The
    first one, decrements before the callback is called:
    
     cpuhp_thread_fun()
         state = st->state;
         st->state--;
         cpuhp_invoke_callback(state);
    
    The second one, after:
    
     take_down_cpu()|cpuhp_down_callbacks()
         cpuhp_invoke_callback(st->state);
         st->state--;
    
    This is problematic for rolling back the steps in case of error, as
    depending on the decrement, the rollback will start from N or N-1. It also
    makes tracing inconsistent, between steps run in the cpuhp thread and
    the others.
    
    Additionally, avoid useless cpuhp_thread_fun() loops by skipping empty
    steps.
    
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lkml.kernel.org/r/20210216103506.416286-4-vincent.donnefort@arm.com
    Vincent Donnefort authored and Ingo Molnar committed Mar 6, 2021
  7. cpu/hotplug: CPUHP_BRINGUP_CPU failure exception

    The atomic states (between CPUHP_AP_IDLE_DEAD and CPUHP_AP_ONLINE) are
    triggered by the CPUHP_BRINGUP_CPU step. If the latter fails, no atomic
    state can be rolled back.
    
    DEAD callbacks too can't fail and disallow recovery. As a consequence,
    during hotunplug, the fail injection interface should prohibit all states
    from CPUHP_BRINGUP_CPU to CPUHP_ONLINE.
    
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lkml.kernel.org/r/20210216103506.416286-3-vincent.donnefort@arm.com
    Vincent Donnefort authored and Ingo Molnar committed Mar 6, 2021
  8. cpu/hotplug: Allowing to reset fail injection

    Currently, the only way of resetting the fail injection is to trigger a
    hotplug, hotunplug or both. This is rather annoying for testing
    and, as the default value for this file is -1, it seems pretty natural to
    let a user write it.
    
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: https://lkml.kernel.org/r/20210216103506.416286-2-vincent.donnefort@arm.com
    Vincent Donnefort authored and Ingo Molnar committed Mar 6, 2021
  9. sched/pelt: Fix task util_est update filtering

    Being called for each dequeue, util_est reduces the number of its updates
    by filtering out when the EWMA signal is different from the task util_avg
    by less than 1%. It is a problem for a sudden util_avg ramp-up. Due to the
    decay from a previous high util_avg, EWMA might now be close enough to
    the new util_avg. No update would then happen while it would leave
    ue.enqueued with an out-of-date value.
    
    Taking into consideration the two util_est members, EWMA and enqueued for
    the filtering, ensures, for both, an up-to-date value.
    
    This is for now an issue only for the trace probe that might return the
    stale value. Functional-wise, it isn't a problem, as the value is always
    accessed through max(enqueued, ewma).
    
    This problem has been observed using LISA's UtilConvergence:test_means on
    the sd845c board.
    
    No regression observed with Hackbench on sd845c and Perf-bench sched pipe
    on hikey/hikey960.
    
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210225165820.1377125-1-vincent.donnefort@arm.com
    Vincent Donnefort authored and Ingo Molnar committed Mar 6, 2021
  10. sched/fair: Fix shift-out-of-bounds in load_balance()

    Syzbot reported a handful of occurrences where an sd->nr_balance_failed can
    grow to much higher values than one would expect.
    
    A successful load_balance() resets it to 0; a failed one increments
    it. Once it gets to sd->cache_nice_tries + 3, this *should* trigger an
    active balance, which will either set it to sd->cache_nice_tries+1 or reset
    it to 0. However, in case the to-be-active-balanced task is not allowed to
    run on env->dst_cpu, then the increment is done without any further
    modification.
    
    This could then be repeated ad nauseam, and would explain the absurdly high
    values reported by syzbot (86, 149). VincentG noted there is value in
    letting sd->cache_nice_tries grow, so the shift itself should be
    fixed. That means preventing:
    
      """
      If the value of the right operand is negative or is greater than or equal
      to the width of the promoted left operand, the behavior is undefined.
      """
    
    Thus we need to cap the shift exponent to
      BITS_PER_TYPE(typeof(lefthand)) - 1.
    
    I had a look around for other similar cases via coccinelle:
    
      @expr@
      position pos;
      expression E1;
      expression E2;
      @@
      (
      E1 >> E2@pos
      |
      E1 >> E2@pos
      )
    
      @cst depends on expr@
      position pos;
      expression expr.E1;
      constant cst;
      @@
      (
      E1 >> cst@pos
      |
      E1 << cst@pos
      )
    
      @script:python depends on !cst@
      pos << expr.pos;
      exp << expr.E2;
      @@
      # Dirty hack to ignore constexpr
      if exp.upper() != exp:
         coccilib.report.print_report(pos[0], "Possible UB shift here")
    
    The only other match in kernel/sched is rq_clock_thermal() which employs
    sched_thermal_decay_shift, and that exponent is already capped to 10, so
    that one is fine.
    
    Fixes: 5a7f555 ("sched/fair: Relax constraint on task's load during load balance")
    Reported-by: syzbot+d7581744d5fd27c9fbe1@syzkaller.appspotmail.com
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Link: http://lore.kernel.org/r/000000000000ffac1205b9a2112f@google.com
    valschneider authored and Ingo Molnar committed Mar 6, 2021
  11. sched/fair: use lsub_positive in cpu_util_next()

    The sub_positive local version is saving an explicit load-store and is
    enough for the cpu_util_next() usage.
    
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Quentin Perret <qperret@google.com>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lkml.kernel.org/r/20210225083612.1113823-3-vincent.donnefort@arm.com
    Vincent Donnefort authored and Ingo Molnar committed Mar 6, 2021
  12. sched/fair: Fix task utilization accountability in compute_energy()

    find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
    energy delta as follows:
    
      feec(task)
        for_each_pd
          base_energy = compute_energy(task, -1, pd)
            -> for_each_cpu(pd)
               -> cpu_util_next(cpu, task, -1)
    
          energy_delta = compute_energy(task, dst_cpu, pd)
            -> for_each_cpu(pd)
               -> cpu_util_next(cpu, task, dst_cpu)
          energy_delta -= base_energy
    
    Then it picks the best CPU as being the one that minimizes energy_delta.
    
    cpu_util_next() estimates the CPU utilization that would happen if the
    task was placed on dst_cpu as follows:
    
      max(cpu_util + task_util, cpu_util_est + _task_util_est)
    
    The task contribution to the energy delta can then be either:
    
      (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
          and _task_util_est > cpu_util.
      (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
    
      (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
       otherwise must be small enough so that feec() takes the CPU as a
       potential target for the task placement)
    
    This is problematic for feec(), as cpu_util_next() might give an unfair
    advantage to a CPU which is mostly busy (2) compared to one which is
    mostly idle (1). _task_util_est being always bigger than task_util in
    feec() (as the task is waking up), the task contribution to the energy
    might look smaller on certain CPUs (2) and this breaks the energy
    comparison.
    
    This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
    their cpu_util < _task_util_est (1) while others will maintain cpu_util >
    _task_util_est (2).
    
    Fix this problem by always using max(task_util, _task_util_est) as a task
    contribution to the energy (ENERGY_UTIL). The new estimated CPU
    utilization for the energy would then be:
    
      max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
    
    compute_energy() still needs to know which OPP would be selected if the
    task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
    cpu_util_next() is still used to estimate the maximum util within the pd.
    
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Quentin Perret <qperret@google.com>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
    Vincent Donnefort authored and Ingo Molnar committed Mar 6, 2021
  13. sched/fair: Reduce the window for duplicated update

    Start to update last_blocked_load_update_tick to reduce the possibility
    of another cpu starting the update one more time
    
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210224133007.28644-8-vincent.guittot@linaro.org
    vingu-linaro authored and Ingo Molnar committed Mar 6, 2021
  14. sched/fair: Trigger the update of blocked load on newly idle cpu

    Instead of waking up a random and already idle CPU, we can take advantage
    of this_cpu being about to enter idle to run the ILB and update the
    blocked load.
    
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210224133007.28644-7-vincent.guittot@linaro.org
    vingu-linaro authored and Ingo Molnar committed Mar 6, 2021
  15. sched/fair: Reorder newidle_balance pulled_task tests

    Reorder the tests and skip useless ones when no load balance has been
    performed and rq lock has not been released.
    
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210224133007.28644-6-vincent.guittot@linaro.org
    vingu-linaro authored and Ingo Molnar committed Mar 6, 2021
  16. sched/fair: Merge for each idle cpu loop of ILB

    Remove the specific case for handling this_cpu outside for_each_cpu() loop
    when running ILB. Instead we use for_each_cpu_wrap() and start with the
    next cpu after this_cpu so we will continue to finish with this_cpu.
    
    update_nohz_stats() is now used for this_cpu too and will prevents
    unnecessary update. We don't need a special case for handling the update of
    nohz.next_balance for this_cpu anymore because it is now handled by the
    loop like others.
    
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210224133007.28644-5-vincent.guittot@linaro.org
    vingu-linaro authored and Ingo Molnar committed Mar 6, 2021
  17. sched/fair: Remove unused parameter of update_nohz_stats

    idle load balance is the only user of update_nohz_stats and doesn't use
    force parameter. Remove it
    
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210224133007.28644-4-vincent.guittot@linaro.org
    vingu-linaro authored and Ingo Molnar committed Mar 6, 2021
  18. sched/fair: Remove unused return of _nohz_idle_balance

    The return of _nohz_idle_balance() is not used anymore so we can remove
    it
    
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lkml.kernel.org/r/20210224133007.28644-3-vincent.guittot@linaro.org
    vingu-linaro authored and Ingo Molnar committed Mar 6, 2021
Older