Skip to content
Permalink
Li-RongQing/KV…
Switch branches/tags

Commits on Jul 22, 2021

  1. KVM: Consider SMT idle status when halt polling

    SMT siblings share caches and other hardware, halt polling
    will degrade its sibling performance if its sibling is busy
    
    Signed-off-by: Li RongQing <lirongqing@baidu.com>
    lrq-max authored and intel-lab-lkp committed Jul 22, 2021

Commits on Jun 28, 2021

  1. sched: Optimize housekeeping_cpumask() in for_each_cpu_and()

    On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
    the others are used for housekeeping. When many housekeeping cpus are
    in idle state, we can observe huge time burn in the loop for searching
    nearest busy housekeeper cpu by ftrace.
    
       9)               |              get_nohz_timer_target() {
       9)               |                housekeeping_test_cpu() {
       9)   0.390 us    |                  housekeeping_get_mask.part.1();
       9)   0.561 us    |                }
       9)   0.090 us    |                __rcu_read_lock();
       9)   0.090 us    |                housekeeping_cpumask();
       9)   0.521 us    |                housekeeping_cpumask();
       9)   0.140 us    |                housekeeping_cpumask();
    
       ...
    
       9)   0.500 us    |                housekeeping_cpumask();
       9)               |                housekeeping_any_cpu() {
       9)   0.090 us    |                  housekeeping_get_mask.part.1();
       9)   0.100 us    |                  sched_numa_find_closest();
       9)   0.491 us    |                }
       9)   0.100 us    |                __rcu_read_unlock();
       9) + 76.163 us   |              }
    
    for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
    function the
            for_each_cpu_and(i, sched_domain_span(sd),
                    housekeeping_cpumask(HK_FLAG_TIMER))
    equals to below:
            for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
                    housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
    That will cause that housekeeping_cpumask() will be invoked many times.
    The housekeeping_cpumask() function returns a const value, so it is
    unnecessary to invoke it every time. This patch can minimize the worst
    searching time from ~76us to ~16us in my testing.
    
    Similarly, the find_new_ilb() function has the same problem.
    
    Co-developed-by: Li RongQing <lirongqing@baidu.com>
    Signed-off-by: Li RongQing <lirongqing@baidu.com>
    Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/1622985115-51007-1-git-send-email-yuanzhaoxiong@baidu.com
    Yuan ZhaoXiong authored and Peter Zijlstra committed Jun 28, 2021
  2. sched/sysctl: Move extern sysctl declarations to sched.h

    Since commit '8a99b6833c88(sched: Move SCHED_DEBUG sysctl to debugfs)',
    SCHED_DEBUG sysctls are moved to debugfs, so these extern sysctls in
    include/linux/sched/sysctl.h are no longer needed for sysctl.c, even
    some are no longer needed.
    
    So move those extern sysctls that needed by kernel/sched/debug.c to
    kernel/sched/sched.h, and remove others that are no longer needed.
    
    Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210606115451.26745-1-liuhailongg6@163.com
    Hailong Liu authored and Peter Zijlstra committed Jun 28, 2021
  3. wait: use LIST_HEAD_INIT() to initialize wait_queue_head

    Replace the open-coded initialization with the right macro.
    
    Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210601151120.329223-1-jwi@linux.ibm.com
    julianwiedmann authored and Peter Zijlstra committed Jun 28, 2021
  4. sched/debug: Don't update sched_domain debug directories before sched…

    …_debug_init()
    
    Since CPU capacity asymmetry can stem purely from maximum frequency
    differences (e.g. Pixel 1), a rebuild of the scheduler topology can be
    issued upon loading cpufreq, see:
    
      arch_topology.c::init_cpu_capacity_callback()
    
    Turns out that if this rebuild happens *before* sched_debug_init() is
    run (which is a late initcall), we end up messing up the sched_domain debug
    directory: passing a NULL parent to debugfs_create_dir() ends up creating
    the directory at the debugfs root, which in this case creates
    /sys/kernel/debug/domains (instead of /sys/kernel/debug/sched/domains).
    
    This currently doesn't happen on asymmetric systems which use cpufreq-scpi
    or cpufreq-dt drivers, as those are loaded via
    deferred_probe_initcall() (it is also a late initcall, but appears to be
    ordered *after* sched_debug_init()).
    
    Ionela has been working on detecting maximum frequency asymmetry via ACPI,
    and that actually happens via a *device* initcall, thus before
    sched_debug_init(), and causes the aforementionned debugfs mayhem.
    
    One option would be to punt sched_debug_init() down to
    fs_initcall_sync(). Preventing update_sched_domain_debugfs() from running
    before sched_debug_init() appears to be the safer option.
    
    Fixes: 3b87f13 ("sched,debug: Convert sysctl sched_domains to debugfs")
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: http://lore.kernel.org/r/20210514095339.12979-1-ionela.voinescu@arm.com
    valschneider authored and Peter Zijlstra committed Jun 28, 2021
  5. sched/fair: Ensure _sum and _avg values stay consistent

    The _sum and _avg values are in general sync together with the PELT
    divider. They are however not always completely in perfect sync,
    resulting in situations where _sum gets to zero while _avg stays
    positive. Such situations are undesirable.
    
    This comes from the fact that PELT will increase period_contrib, also
    increasing the PELT divider, without updating _sum and _avg values to
    stay in perfect sync where (_sum == _avg * divider). However, such PELT
    change will never lower _sum, making it impossible to end up in a
    situation where _sum is zero and _avg is not.
    
    Therefore, we need to ensure that when subtracting load outside PELT,
    that when _sum is zero, _avg is also set to zero. This occurs when
    (_sum < _avg * divider), and the subtracted (_avg * divider) is bigger
    or equal to the current _sum, while the subtracted _avg is smaller than
    the current _avg.
    
    Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
    Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
    Signed-off-by: Odin Ugedal <odin@uged.al>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
    Link: https://lore.kernel.org/r/20210624111815.57937-1-odin@uged.al
    odinuge authored and Peter Zijlstra committed Jun 28, 2021

Commits on Jun 24, 2021

  1. sched/doc: Update the CPU capacity asymmetry bits

    Update the documentation bits referring to capacity aware scheduling
    with regards to newly introduced SD_ASYM_CPUCAPACITY_FULL sched_domain
    flag.
    
    Signed-off-by: Beata Michalska <beata.michalska@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20210603140627.8409-4-beata.michalska@arm.com
    bea-arm authored and Peter Zijlstra committed Jun 24, 2021
  2. sched/topology: Rework CPU capacity asymmetry detection

    Currently the CPU capacity asymmetry detection, performed through
    asym_cpu_capacity_level, tries to identify the lowest topology level
    at which the highest CPU capacity is being observed, not necessarily
    finding the level at which all possible capacity values are visible
    to all CPUs, which might be bit problematic for some possible/valid
    asymmetric topologies i.e.:
    
    DIE      [                                ]
    MC       [                       ][       ]
    
    CPU       [0] [1] [2] [3] [4] [5]  [6] [7]
    Capacity  |.....| |.....| |.....|  |.....|
    	     L	     M       B        B
    
    Where:
     arch_scale_cpu_capacity(L) = 512
     arch_scale_cpu_capacity(M) = 871
     arch_scale_cpu_capacity(B) = 1024
    
    In this particular case, the asymmetric topology level will point
    at MC, as all possible CPU masks for that level do cover the CPU
    with the highest capacity. It will work just fine for the first
    cluster, not so much for the second one though (consider the
    find_energy_efficient_cpu which might end up attempting the energy
    aware wake-up for a domain that does not see any asymmetry at all)
    
    Rework the way the capacity asymmetry levels are being detected,
    allowing to point to the lowest topology level (for a given CPU), where
    full set of available CPU capacities is visible to all CPUs within given
    domain. As a result, the per-cpu sd_asym_cpucapacity might differ across
    the domains. This will have an impact on EAS wake-up placement in a way
    that it might see different range of CPUs to be considered, depending on
    the given current and target CPUs.
    
    Additionally, those levels, where any range of asymmetry (not
    necessarily full) is being detected will get identified as well.
    The selected asymmetric topology level will be denoted by
    SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels'
    would receive the already used SD_ASYM_CPUCAPACITY flag. This allows
    maintaining the current behaviour for asymmetric topologies, with
    misfit migration operating correctly on lower levels, if applicable,
    as any asymmetry is enough to trigger the misfit migration.
    The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not
    relate to the full asymmetry level denoted by the sd_asym_cpucapacity
    pointer.
    
    Detecting the CPU capacity asymmetry is being based on a set of
    available CPU capacities for all possible CPUs. This data is being
    generated upon init and updated once CPU topology changes are being
    detected (through arch_update_cpu_topology). As such, any changes
    to identified CPU capacities (like initializing cpufreq) need to be
    explicitly advertised by corresponding archs to trigger rebuilding
    the data.
    
    Additional -dflags- parameter, used when building sched domains, has
    been removed as well, as the asymmetry flags are now being set directly
    in sd_init.
    
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Beata Michalska <beata.michalska@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Link: https://lore.kernel.org/r/20210603140627.8409-3-beata.michalska@arm.com
    bea-arm authored and Peter Zijlstra committed Jun 24, 2021
  3. sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag

    Introducing new, complementary to SD_ASYM_CPUCAPACITY, sched_domain
    topology flag, to distinguish between shed_domains where any CPU
    capacity asymmetry is detected (SD_ASYM_CPUCAPACITY) and ones where
    a full set of CPU capacities is visible to all domain members
    (SD_ASYM_CPUCAPACITY_FULL).
    
    With the distinction between full and partial CPU capacity asymmetry,
    brought in by the newly introduced flag, the scope of the original
    SD_ASYM_CPUCAPACITY flag gets shifted, still maintaining the existing
    behaviour when one is detected on a given sched domain, allowing
    misfit migrations within sched domains that do not observe full range
    of CPU capacities but still do have members with different capacity
    values. It loses though it's meaning when it comes to the lowest CPU
    asymmetry sched_domain level per-cpu pointer, which is to be now
    denoted by SD_ASYM_CPUCAPACITY_FULL flag.
    
    Signed-off-by: Beata Michalska <beata.michalska@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20210603140627.8409-2-beata.michalska@arm.com
    bea-arm authored and Peter Zijlstra committed Jun 24, 2021
  4. psi: Fix race between psi_trigger_create/destroy

    Race detected between psi_trigger_destroy/create as shown below, which
    cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry
    and psi_system->poll_timer->entry->next. Under this modification, the
    race window is removed by initialising poll_wait and poll_timer in
    group_init which are executed only once at beginning.
    
      psi_trigger_destroy()                   psi_trigger_create()
    
      mutex_lock(trigger_lock);
      rcu_assign_pointer(poll_task, NULL);
      mutex_unlock(trigger_lock);
    					  mutex_lock(trigger_lock);
    					  if (!rcu_access_pointer(group->poll_task)) {
    					    timer_setup(poll_timer, poll_timer_fn, 0);
    					    rcu_assign_pointer(poll_task, task);
    					  }
    					  mutex_unlock(trigger_lock);
    
      synchronize_rcu();
      del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by
                                      psi_trigger_create()
    
    So, trigger_lock/RCU correctly protects destruction of
    group->poll_task but misses this race affecting poll_timer and
    poll_wait.
    
    Fixes: 461daba ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
    Co-developed-by: ziwei.dai <ziwei.dai@unisoc.com>
    Signed-off-by: ziwei.dai <ziwei.dai@unisoc.com>
    Co-developed-by: ke.wang <ke.wang@unisoc.com>
    Signed-off-by: ke.wang <ke.wang@unisoc.com>
    Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com
    Zhaoyang Huang authored and Peter Zijlstra committed Jun 24, 2021
  5. sched/fair: Introduce the burstable CFS controller

    The CFS bandwidth controller limits CPU requests of a task group to
    quota during each period. However, parallel workloads might be bursty
    so that they get throttled even when their average utilization is under
    quota. And they are latency sensitive at the same time so that
    throttling them is undesired.
    
    We borrow time now against our future underrun, at the cost of increased
    interference against the other system users. All nicely bounded.
    
    Traditional (UP-EDF) bandwidth control is something like:
    
      (U = \Sum u_i) <= 1
    
    This guaranteeds both that every deadline is met and that the system is
    stable. After all, if U were > 1, then for every second of walltime,
    we'd have to run more than a second of program time, and obviously miss
    our deadline, but the next deadline will be further out still, there is
    never time to catch up, unbounded fail.
    
    This work observes that a workload doesn't always executes the full
    quota; this enables one to describe u_i as a statistical distribution.
    
    For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
    (the traditional WCET). This effectively allows u to be smaller,
    increasing the efficiency (we can pack more tasks in the system), but at
    the cost of missing deadlines when all the odds line up. However, it
    does maintain stability, since every overrun must be paired with an
    underrun as long as our x is above the average.
    
    That is, suppose we have 2 tasks, both specify a p(95) value, then we
    have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
    everything is good. At the same time we have a p(5)p(5) = 0.25% chance
    both tasks will exceed their quota at the same time (guaranteed deadline
    fail). Somewhere in between there's a threshold where one exceeds and
    the other doesn't underrun enough to compensate; this depends on the
    specific CDFs.
    
    At the same time, we can say that the worst case deadline miss, will be
    \Sum e_i; that is, there is a bounded tardiness (under the assumption
    that x+e is indeed WCET).
    
    The benefit of burst is seen when testing with schbench. Default value of
    kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
    
    	mkdir /sys/fs/cgroup/cpu/test
    	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
    	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
    	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
    
    	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
    
    The average CPU usage is at 80%. I run this for 10 times, and got long tail
    latency for 6 times and got throttled for 8 times.
    
    Tail latencies are shown below, and it wasn't the worst case.
    
    	Latency percentiles (usec)
    		50.0000th: 19872
    		75.0000th: 21344
    		90.0000th: 22176
    		95.0000th: 22496
    		*99.0000th: 22752
    		99.5000th: 22752
    		99.9000th: 22752
    		min=0, max=22727
    	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
    
    The interferenece when using burst is valued by the possibilities for
    missing the deadline and the average WCET. Test results showed that when
    there many cgroups or CPU is under utilized, the interference is
    limited. More details are shown in:
    https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
    
    Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
    Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
    Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Ben Segall <bsegall@google.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
    changhuaixin authored and Peter Zijlstra committed Jun 24, 2021

Commits on Jun 22, 2021

  1. sched/uclamp: Fix uclamp_tg_restrict()

    Now cpu.uclamp.min acts as a protection, we need to make sure that the
    uclamp request of the task is within the allowed range of the cgroup,
    that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and
    tg->uclamp[UCLAMP_MAX].
    
    As reported by Xuewen [1] we can have some corner cases where there's
    inversion between uclamp requested by task (p) and the uclamp values of
    the taskgroup it's attached to (tg). Following table demonstrates
    2 corner cases:
    
    	           |  p  |  tg  |  effective
    	-----------+-----+------+-----------
    	CASE 1
    	-----------+-----+------+-----------
    	uclamp_min | 60% | 0%   |  60%
    	-----------+-----+------+-----------
    	uclamp_max | 80% | 50%  |  50%
    	-----------+-----+------+-----------
    	CASE 2
    	-----------+-----+------+-----------
    	uclamp_min | 0%  | 30%  |  30%
    	-----------+-----+------+-----------
    	uclamp_max | 20% | 50%  |  20%
    	-----------+-----+------+-----------
    
    With this fix we get:
    
    	           |  p  |  tg  |  effective
    	-----------+-----+------+-----------
    	CASE 1
    	-----------+-----+------+-----------
    	uclamp_min | 60% | 0%   |  50%
    	-----------+-----+------+-----------
    	uclamp_max | 80% | 50%  |  50%
    	-----------+-----+------+-----------
    	CASE 2
    	-----------+-----+------+-----------
    	uclamp_min | 0%  | 30%  |  30%
    	-----------+-----+------+-----------
    	uclamp_max | 20% | 50%  |  30%
    	-----------+-----+------+-----------
    
    Additionally uclamp_update_active_tasks() must now unconditionally
    update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for
    instance could have an impact on the effective UCLAMP_MIN of the tasks.
    
    	           |  p  |  tg  |  effective
    	-----------+-----+------+-----------
    	old
    	-----------+-----+------+-----------
    	uclamp_min | 60% | 0%   |  50%
    	-----------+-----+------+-----------
    	uclamp_max | 80% | 50%  |  50%
    	-----------+-----+------+-----------
    	*new*
    	-----------+-----+------+-----------
    	uclamp_min | 60% | 0%   | *60%*
    	-----------+-----+------+-----------
    	uclamp_max | 80% |*70%* | *70%*
    	-----------+-----+------+-----------
    
    [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/
    
    Fixes: 0c18f2e ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min")
    Reported-by: Xuewen Yan <xuewen.yan94@gmail.com>
    Signed-off-by: Qais Yousef <qais.yousef@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
    qais-yousef authored and Peter Zijlstra committed Jun 22, 2021
  2. sched/rt: Fix Deadline utilization tracking during policy change

    DL keeps track of the utilization on a per-rq basis with the structure
    avg_dl. This utilization is updated during task_tick_dl(),
    put_prev_task_dl() and set_next_task_dl(). However, when the current
    running task changes its policy, set_next_task_dl() which would usually
    take care of updating the utilization when the rq starts running DL
    tasks, will not see a such change, leaving the avg_dl structure outdated.
    When that very same task will be dequeued later, put_prev_task_dl() will
    then update the utilization, based on a wrong last_update_time, leading to
    a huge spike in the DL utilization signal.
    
    The signal would eventually recover from this issue after few ms. Even
    if no DL tasks are run, avg_dl is also updated in
    __update_blocked_others(). But as the CPU capacity depends partly on the
    avg_dl, this issue has nonetheless a significant impact on the scheduler.
    
    Fix this issue by ensuring a load update when a running task changes
    its policy to DL.
    
    Fixes: 3727e0e ("sched/dl: Add dl_rq utilization tracking")
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/1624271872-211872-3-git-send-email-vincent.donnefort@arm.com
    Vincent Donnefort authored and Peter Zijlstra committed Jun 22, 2021
  3. sched/rt: Fix RT utilization tracking during policy change

    RT keeps track of the utilization on a per-rq basis with the structure
    avg_rt. This utilization is updated during task_tick_rt(),
    put_prev_task_rt() and set_next_task_rt(). However, when the current
    running task changes its policy, set_next_task_rt() which would usually
    take care of updating the utilization when the rq starts running RT tasks,
    will not see a such change, leaving the avg_rt structure outdated. When
    that very same task will be dequeued later, put_prev_task_rt() will then
    update the utilization, based on a wrong last_update_time, leading to a
    huge spike in the RT utilization signal.
    
    The signal would eventually recover from this issue after few ms. Even if
    no RT tasks are run, avg_rt is also updated in __update_blocked_others().
    But as the CPU capacity depends partly on the avg_rt, this issue has
    nonetheless a significant impact on the scheduler.
    
    Fix this issue by ensuring a load update when a running task changes
    its policy to RT.
    
    Fixes: 371bf42 ("sched/rt: Add rt_rq utilization tracking")
    Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
    Vincent Donnefort authored and Peter Zijlstra committed Jun 22, 2021

Commits on Jun 18, 2021

  1. sched: Change task_struct::state

    Change the type and name of task_struct::state. Drop the volatile and
    shrink it to an 'unsigned int'. Rename it in order to find all uses
    such that we can use READ_ONCE/WRITE_ONCE as appropriate.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
    Acked-by: Will Deacon <will@kernel.org>
    Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
    Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
    Peter Zijlstra committed Jun 18, 2021
  2. sched,arch: Remove unused TASK_STATE offsets

    All 6 architectures define TASK_STATE in asm-offsets, but then never
    actually use it. Remove the definitions to make sure they never will.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20210611082838.472811363@infradead.org
    Peter Zijlstra committed Jun 18, 2021
  3. sched,timer: Use __set_current_state()

    There's an existing helper for setting TASK_RUNNING; must've gotten
    lost last time we did this cleanup.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
    Acked-by: Will Deacon <will@kernel.org>
    Link: https://lore.kernel.org/r/20210611082838.409696194@infradead.org
    Peter Zijlstra committed Jun 18, 2021
  4. sched: Add get_current_state()

    Remove yet another few p->state accesses.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Will Deacon <will@kernel.org>
    Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org
    Peter Zijlstra committed Jun 18, 2021
  5. sched,perf,kvm: Fix preemption condition

    When ran from the sched-out path (preempt_notifier or perf_event),
    p->state is irrelevant to determine preemption. You can get preempted
    with !task_is_running() just fine.
    
    The right indicator for preemption is if the task is still on the
    runqueue in the sched-out path.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Mark Rutland <mark.rutland@arm.com>
    Link: https://lore.kernel.org/r/20210611082838.285099381@infradead.org
    Peter Zijlstra committed Jun 18, 2021
  6. sched: Introduce task_is_running()

    Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
    task_is_running(p).
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Davidlohr Bueso <dave@stgolabs.net>
    Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Acked-by: Will Deacon <will@kernel.org>
    Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
    Peter Zijlstra committed Jun 18, 2021
  7. sched: Unbreak wakeups

    Remove broken task->state references and let wake_up_process() DTRT.
    
    The anti-pattern in these patches breaks the ordering of ->state vs
    COND as described in the comment near set_current_state() and can lead
    to missed wakeups:
    
    	(OoO load, observes RUNNING)<-.
    	for (;;) {                    |
    	  t->state = UNINTERRUPTIBLE; |
    	  smp_mb();          ,----->  | (observes !COND)
                                 |        /
    	  if (COND) ---------'       |	COND = 1;
    		break;		     `- if (t->state != RUNNING)
    					  wake_up_process(t); // not done
    	  schedule(); // forever waiting
    	}
    	t->state = TASK_RUNNING;
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Acked-by: Will Deacon <will@kernel.org>
    Link: https://lore.kernel.org/r/20210611082838.160855222@infradead.org
    Peter Zijlstra committed Jun 18, 2021
  8. Merge branch 'sched/urgent' into sched/core, to resolve conflicts

    This commit in sched/urgent moved the cfs_rq_is_decayed() function:
    
      a7b359f: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
    
    and this fresh commit in sched/core modified it in the old location:
    
      9e077b5: ("sched/pelt: Check that *_avg are null when *_sum are")
    
    Merge the two variants.
    
    Conflicts:
    	kernel/sched/fair.c
    
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Ingo Molnar committed Jun 18, 2021

Commits on Jun 17, 2021

  1. sched/fair: Age the average idle time

    This is a partial forward-port of Peter Ziljstra's work first posted
    at:
    
       https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/
    
    Currently select_idle_cpu()'s proportional scheme uses the average idle
    time *for when we are idle*, that is temporally challenged.  When a CPU
    is not at all idle, we'll happily continue using whatever value we did
    see when the CPU goes idle. To fix this, introduce a separate average
    idle and age it (the existing value still makes sense for things like
    new-idle balancing, which happens when we do go idle).
    
    The overall goal is to not spend more time scanning for idle CPUs than
    we're idle for. Otherwise we're inhibiting work. This means that we need to
    consider the cost over all the wake-ups between consecutive idle periods.
    To track this, the scan cost is subtracted from the estimated average
    idle time.
    
    The impact of this patch is related to workloads that have domains that
    are fully busy or overloaded. Without the patch, the scan depth may be
    too high because a CPU is not reaching idle.
    
    Due to the nature of the patch, this is a regression magnet. It
    potentially wins when domains are almost fully busy or overloaded --
    at that point searches are likely to fail but idle is not being aged
    as CPUs are active so search depth is too large and useless. It will
    potentially show regressions when there are idle CPUs and a deep search is
    beneficial. This tbench result on a 2-socket broadwell machine partially
    illustates the problem
    
                              5.13.0-rc2             5.13.0-rc2
                                 vanilla     sched-avgidle-v1r5
    Hmean     1        445.02 (   0.00%)      451.36 *   1.42%*
    Hmean     2        830.69 (   0.00%)      846.03 *   1.85%*
    Hmean     4       1350.80 (   0.00%)     1505.56 *  11.46%*
    Hmean     8       2888.88 (   0.00%)     2586.40 * -10.47%*
    Hmean     16      5248.18 (   0.00%)     5305.26 *   1.09%*
    Hmean     32      8914.03 (   0.00%)     9191.35 *   3.11%*
    Hmean     64     10663.10 (   0.00%)    10192.65 *  -4.41%*
    Hmean     128    18043.89 (   0.00%)    18478.92 *   2.41%*
    Hmean     256    16530.89 (   0.00%)    17637.16 *   6.69%*
    Hmean     320    16451.13 (   0.00%)    17270.97 *   4.98%*
    
    Note that 8 was a regression point where a deeper search would have helped
    but it gains for high thread counts when searches are useless. Hackbench
    is a more extreme example although not perfect as the tasks idle rapidly
    
    hackbench-process-pipes
                              5.13.0-rc2             5.13.0-rc2
                                 vanilla     sched-avgidle-v1r5
    Amean     1        0.3950 (   0.00%)      0.3887 (   1.60%)
    Amean     4        0.9450 (   0.00%)      0.9677 (  -2.40%)
    Amean     7        1.4737 (   0.00%)      1.4890 (  -1.04%)
    Amean     12       2.3507 (   0.00%)      2.3360 *   0.62%*
    Amean     21       4.0807 (   0.00%)      4.0993 *  -0.46%*
    Amean     30       5.6820 (   0.00%)      5.7510 *  -1.21%*
    Amean     48       8.7913 (   0.00%)      8.7383 (   0.60%)
    Amean     79      14.3880 (   0.00%)     13.9343 *   3.15%*
    Amean     110     21.2233 (   0.00%)     19.4263 *   8.47%*
    Amean     141     28.2930 (   0.00%)     25.1003 *  11.28%*
    Amean     172     34.7570 (   0.00%)     30.7527 *  11.52%*
    Amean     203     41.0083 (   0.00%)     36.4267 *  11.17%*
    Amean     234     47.7133 (   0.00%)     42.0623 *  11.84%*
    Amean     265     53.0353 (   0.00%)     47.7720 *   9.92%*
    Amean     296     60.0170 (   0.00%)     53.4273 *  10.98%*
    Stddev    1        0.0052 (   0.00%)      0.0025 (  51.57%)
    Stddev    4        0.0357 (   0.00%)      0.0370 (  -3.75%)
    Stddev    7        0.0190 (   0.00%)      0.0298 ( -56.64%)
    Stddev    12       0.0064 (   0.00%)      0.0095 ( -48.38%)
    Stddev    21       0.0065 (   0.00%)      0.0097 ( -49.28%)
    Stddev    30       0.0185 (   0.00%)      0.0295 ( -59.54%)
    Stddev    48       0.0559 (   0.00%)      0.0168 (  69.92%)
    Stddev    79       0.1559 (   0.00%)      0.0278 (  82.17%)
    Stddev    110      1.1728 (   0.00%)      0.0532 (  95.47%)
    Stddev    141      0.7867 (   0.00%)      0.0968 (  87.69%)
    Stddev    172      1.0255 (   0.00%)      0.0420 (  95.91%)
    Stddev    203      0.8106 (   0.00%)      0.1384 (  82.92%)
    Stddev    234      1.1949 (   0.00%)      0.1328 (  88.89%)
    Stddev    265      0.9231 (   0.00%)      0.0820 (  91.11%)
    Stddev    296      1.0456 (   0.00%)      0.1327 (  87.31%)
    
    Again, higher thread counts benefit and the standard deviation
    shows that results are also a lot more stable when the idle
    time is aged.
    
    The patch potentially matters when a socket was multiple LLCs as the
    maximum search depth is lower. However, some of the test results were
    suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and
    other results were not dramatically different to other mcahines.
    
    Given the nature of the patch, Peter's full series is not being forward
    ported as each part should stand on its own. Preferably they would be
    merged at different times to reduce the risk of false bisections.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
    Peter Zijlstra committed Jun 17, 2021
  2. sched/cpufreq: Consider reduced CPU capacity in energy calculation

    Energy Aware Scheduling (EAS) needs to predict the decisions made by
    SchedUtil. The map_util_freq() exists to do that.
    
    There are corner cases where the max allowed frequency might be reduced
    (due to thermal). SchedUtil as a CPUFreq governor, is aware of that
    but EAS is not. This patch aims to address it.
    
    SchedUtil stores the maximum allowed frequency in
    'sugov_policy::next_freq' field. EAS has to predict that value, which is
    the real used frequency. That value is made after a call to
    cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
    In the existing code EAS is not able to predict that real frequency.
    This leads to energy estimation errors.
    
    To avoid wrong energy estimation in EAS (due to frequency miss prediction)
    make sure that the step which calculates Performance Domain frequency,
    is also aware of the allowed CPU capacity.
    
    Furthermore, modify map_util_freq() to not extend the frequency value.
    Instead, use map_util_perf() to extend the util value in both places:
    SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
    In the end, we achieve the same desirable behavior for both subsystems
    and alignment in regards to the real CPU frequency.
    
    Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part)
    Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
    lukaszluba-arm authored and Peter Zijlstra committed Jun 17, 2021
  3. sched/fair: Take thermal pressure into account while estimating energy

    Energy Aware Scheduling (EAS) needs to be able to predict the frequency
    requests made by the SchedUtil governor to properly estimate energy used
    in the future. It has to take into account CPUs utilization and forecast
    Performance Domain (PD) frequency. There is a corner case when the max
    allowed frequency might be reduced due to thermal. SchedUtil is aware of
    that reduced frequency, so it should be taken into account also in EAS
    estimations.
    
    SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
    a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
    to 'policy::max'. SchedUtil is responsible to respect that upper limit
    while setting the frequency through CPUFreq drivers. This effective
    frequency is stored internally in 'sugov_policy::next_freq' and EAS has
    to predict that value.
    
    In the existing code the raw value of arch_scale_cpu_capacity() is used
    for clamping the returned CPU utilization from effective_cpu_util().
    This patch fixes issue with too big single CPU utilization, by introducing
    clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
    capacity reduced by thermal pressure raw value.
    
    Thanks to knowledge about allowed CPU capacity, we don't get too big value
    for a single CPU utilization, which is then added to the util sum. The
    util sum is used as a source of information for estimating whole PD energy.
    To avoid wrong energy estimation in EAS (due to capped frequency), make
    sure that the calculation of util sum is aware of allowed CPU capacity.
    
    This thermal pressure might be visible in scenarios where the CPUs are not
    heavily loaded, but some other component (like GPU) drastically reduced
    available power budget and increased the SoC temperature. Thus, we still
    use EAS for task placement and CPUs are not over-utilized.
    
    Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lore.kernel.org/r/20210614191128.22735-1-lukasz.luba@arm.com
    lukaszluba-arm authored and Peter Zijlstra committed Jun 17, 2021
  4. thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure

    The thermal pressure signal gives information to the scheduler about
    reduced CPU capacity due to thermal. It is based on a value stored in
    a per-cpu 'thermal_pressure' variable. The online CPUs will get the
    new value there, while the offline won't. Unfortunately, when the CPU
    is back online, the value read from per-cpu variable might be wrong
    (stale data).  This might affect the scheduler decisions, since it
    sees the CPU capacity differently than what is actually available.
    
    Fix it by making sure that all online+offline CPUs would get the
    proper value in their per-cpu variable when thermal framework sets
    capping.
    
    Fixes: f12e4f6 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping")
    Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
    Link: https://lore.kernel.org/r/20210614191030.22241-1-lukasz.luba@arm.com
    lukaszluba-arm authored and Peter Zijlstra committed Jun 17, 2021
  5. sched/fair: Return early from update_tg_cfs_load() if delta == 0

    In case the _avg delta is 0 there is no need to update se's _avg
    (level n) nor cfs_rq's _avg (level n-1). These values stay the same.
    
    Since cfs_rq's _avg isn't changed, i.e. no load is propagated down,
    cfs_rq's _sum should stay the same as well.
    
    So bail out after se's _sum has been updated.
    
    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20210601083616.804229-1-dietmar.eggemann@arm.com
    deggeman authored and Peter Zijlstra committed Jun 17, 2021
  6. sched/pelt: Check that *_avg are null when *_sum are

    Check that we never break the rule that pelt's avg values are null if
    pelt's sum are.
    
    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Acked-by: Odin Ugedal <odin@uged.al>
    Link: https://lore.kernel.org/r/20210601155328.19487-1-vincent.guittot@linaro.org
    vingu-linaro authored and Peter Zijlstra committed Jun 17, 2021

Commits on Jun 14, 2021

  1. sched/fair: Correctly insert cfs_rq's to list on unthrottle

    Fix an issue where fairness is decreased since cfs_rq's can end up not
    being decayed properly. For two sibling control groups with the same
    priority, this can often lead to a load ratio of 99/1 (!!).
    
    This happens because when a cfs_rq is throttled, all the descendant
    cfs_rq's will be removed from the leaf list. When they initial cfs_rq
    is unthrottled, it will currently only re add descendant cfs_rq's if
    they have one or more entities enqueued. This is not a perfect
    heuristic.
    
    Instead, we insert all cfs_rq's that contain one or more enqueued
    entities, or it its load is not completely decayed.
    
    Can often lead to situations like this for equally weighted control
    groups:
    
      $ ps u -C stress
      USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
      root       10009 88.8  0.0   3676   100 pts/1    R+   11:04   0:13 stress --cpu 1
      root       10023  3.0  0.0   3676   104 pts/1    R+   11:04   0:00 stress --cpu 1
    
    Fixes: 31bc6ae ("sched/fair: Optimize update_blocked_averages()")
    [vingo: !SMP build fix]
    Signed-off-by: Odin Ugedal <odin@uged.al>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20210612112815.61678-1-odin@uged.al
    odinuge authored and Peter Zijlstra committed Jun 14, 2021

Commits on Jun 13, 2021

  1. Linux 5.13-rc6

    torvalds committed Jun 13, 2021
  2. Merge tag 'perf-tools-fixes-for-v5.13-2021-06-13' of git://git.kernel…

    ….org/pub/scm/linux/kernel/git/acme/linux
    
    Pull perf tools fixes from Arnaldo Carvalho de Melo:
    
     - Correct buffer copying when peeking events
    
     - Sync cpufeatures/disabled-features.h header with the kernel sources
    
    * tag 'perf-tools-fixes-for-v5.13-2021-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
      tools headers cpufeatures: Sync with the kernel sources
      perf session: Correct buffer copying when peeking events
    torvalds committed Jun 13, 2021
  3. Merge tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondm…

    …y/linux-nfs
    
    Pull NFS client bugfixes from Trond Myklebust:
     "Highlights include:
    
      Stable fixes:
    
       - Fix use-after-free in nfs4_init_client()
    
      Bugfixes:
    
       - Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
    
       - Fix second deadlock in nfs4_evict_inode()
    
       - nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP
    
       - Fix setting of the NFS_CAP_SECURITY_LABEL capability"
    
    * tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
      NFSv4: Fix second deadlock in nfs4_evict_inode()
      NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
      NFS: FMODE_READ and friends are C macros, not enum types
      NFS: Fix a potential NULL dereference in nfs_get_client()
      NFS: Fix use-after-free in nfs4_init_client()
      NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate
      NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
    torvalds committed Jun 13, 2021
  4. Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/g…

    …it/jejb/scsi
    
    Pull SCSI fixes from James Bottomley:
     "Four reasonably small fixes to the core for scsi host allocation
      failure paths.
    
      The root problem is that we're not freeing the memory allocated by
      dev_set_name(), which involves a rejig of may of the free on error
      paths to do put_device() instead of kfree which, in turn, has several
      other knock on ramifications and inspection turned up a few other
      lurking bugs"
    
    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
      scsi: core: Only put parent device if host state differs from SHOST_CREATED
      scsi: core: Put .shost_dev in failure path if host state changes to RUNNING
      scsi: core: Fix failure handling of scsi_add_host_with_dma()
      scsi: core: Fix error handling of scsi_host_alloc()
    torvalds committed Jun 13, 2021

Commits on Jun 12, 2021

  1. Merge tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/…

    …linux/kernel/git/riscv/linux
    
    Pull RISC-V fixes from Palmer Dabbelt:
    
     - A pair of XIP fixes: one to fix alternatives, and one to turn off the
       rest of the features that require code modification
    
     - A fix to a type that was causing some alternatives to break
    
     - A build fix for BUILTIN_DTB
    
    * tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
      riscv: Fix BUILTIN_DTB for sifive and microchip soc
      riscv: alternative: fix typo in macro name
      riscv: code patching only works on !XIP_KERNEL
      riscv: xip: support runtime trap patching
    torvalds committed Jun 12, 2021
  2. mm: relocate 'write_protect_seq' in struct mm_struct

    0day robot reported a 9.2% regression for will-it-scale mmap1 test
    case[1], caused by commit 57efa1f ("mm/gup: prevent gup_fast from
    racing with COW during fork").
    
    Further debug shows the regression is due to that commit changes the
    offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
    cache alignment changes.
    
    From the perf data, the contention for 'mmap_lock' is very severe and
    takes around 95% cpu cycles, and it is a rw_semaphore
    
            struct rw_semaphore {
                    atomic_long_t count;	/* 8 bytes */
                    atomic_long_t owner;	/* 8 bytes */
                    struct optimistic_spin_queue osq; /* spinner MCS lock */
                    ...
    
    Before commit 57efa1f adds the 'write_protect_seq', it happens to
    have a very optimal cache alignment layout, as Linus explained:
    
     "and before the addition of the 'write_protect_seq' field, the
      mmap_sem was at offset 120 in 'struct mm_struct'.
    
      Which meant that count and owner were in two different cachelines,
      and then when you have contention and spend time in
      rwsem_down_write_slowpath(), this is probably *exactly* the kind
      of layout you want.
    
      Because first the rwsem_write_trylock() will do a cmpxchg on the
      first cacheline (for the optimistic fast-path), and then in the
      case of contention, rwsem_down_write_slowpath() will just access
      the second cacheline.
    
      Which is probably just optimal for a load that spends a lot of
      time contended - new waiters touch that first cacheline, and then
      they queue themselves up on the second cacheline."
    
    After the commit, the rw_semaphore is at offset 128, which means the
    'count' and 'owner' fields are now in the same cacheline, and causes
    more cache bouncing.
    
    Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
    affect its offset:
    
      CONFIG_MMU
      CONFIG_MEMBARRIER
      CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
    
    The layout above is on 64 bits system with 0day's default kernel config
    (similar to RHEL-8.3's config), in which all these 3 options are 'y'.
    And the layout can vary with different kernel configs.
    
    Relayouting a structure is usually a double-edged sword, as sometimes it
    can helps one case, but hurt other cases.  For this case, one solution
    is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
    (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
    hole in 'mm_struct' will not change other fields' alignment, while
    restoring the regression.
    
    Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Signed-off-by: Feng Tang <feng.tang@intel.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    ftang1 authored and torvalds committed Jun 12, 2021
Older