Skip to content
Permalink
Zhen-Ni/sched-…
Switch branches/tags

Commits on Feb 15, 2022

  1. sched: Move energy_aware sysctls to topology.c

    move energy_aware sysctls to topology.c and use the new
    register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    nizhenth authored and intel-lab-lkp committed Feb 15, 2022
  2. sched: Move cfs_bandwidth_slice sysctls to fair.c

    move cfs_bandwidth_slice sysctls to fair.c and use the
    new register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    nizhenth authored and intel-lab-lkp committed Feb 15, 2022
  3. sched: Move uclamp_util sysctls to core.c

    move uclamp_util sysctls to core.c and use the new
    register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    nizhenth authored and intel-lab-lkp committed Feb 15, 2022
  4. sched: Move rr_timeslice sysctls to rt.c

    move rr_timeslice sysctls to rt.c and use the new
    register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    nizhenth authored and intel-lab-lkp committed Feb 15, 2022
  5. sched: Move deadline_period sysctls to deadline.c

    move deadline_period sysctls to deadline.c and use the new
    register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    nizhenth authored and intel-lab-lkp committed Feb 15, 2022
  6. sched: Move rt_period/runtime sysctls to rt.c

    move rt_period/runtime sysctls to rt.c and use the new
    register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    nizhenth authored and intel-lab-lkp committed Feb 15, 2022
  7. sched: Move schedstats sysctls to core.c

    move schedstats sysctls to core.c and use the new
    register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    nizhenth authored and intel-lab-lkp committed Feb 15, 2022
  8. sched: Move child_runs_first sysctls to fair.c

    move child_runs_first sysctls to fair.c and use the new
    register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    nizhenth authored and intel-lab-lkp committed Feb 15, 2022

Commits on Feb 11, 2022

  1. sched/numa-balancing: Move some document to make it consistent with t…

    …he code
    
    After commit 8a99b68 ("sched: Move SCHED_DEBUG sysctl to
    debugfs"), some NUMA balancing sysctls enclosed with SCHED_DEBUG has
    been moved to debugfs.  This patch move the document for these
    sysctls from
    
      Documentation/admin-guide/sysctl/kernel.rst
    
    to
    
      Documentation/scheduler/sched-debug.rst
    
    to make the document consistent with the code.
    
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Link: https://lkml.kernel.org/r/20220210052514.3038279-1-ying.huang@intel.com
    yhuang-intel authored and Peter Zijlstra committed Feb 11, 2022
  2. sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans mult…

    …iple LLCs
    
    Commit 7d2b5dd ("sched/numa: Allow a floating imbalance between NUMA
    nodes") allowed an imbalance between NUMA nodes such that communicating
    tasks would not be pulled apart by the load balancer. This works fine when
    there is a 1:1 relationship between LLC and node but can be suboptimal
    for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
    
    Zen* has multiple LLCs per node with local memory channels and due to
    the allowed imbalance, it's far harder to tune some workloads to run
    optimally than it is on hardware that has 1 LLC per node. This patch
    allows an imbalance to exist up to the point where LLCs should be balanced
    between nodes.
    
    On a Zen3 machine running STREAM parallelised with OMP to have on instance
    per LLC the results and without binding, the results are
    
                                5.17.0-rc0             5.17.0-rc0
                                   vanilla       sched-numaimb-v6
    MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
    MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
    MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
    MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
    
    STREAM can use directives to force the spread if the OpenMP is new
    enough but that doesn't help if an application uses threads and
    it's not known in advance how many threads will be created.
    
    Coremark is a CPU and cache intensive benchmark parallelised with
    threads. When running with 1 thread per core, the vanilla kernel
    allows threads to contend on cache. With the patch;
    
                                   5.17.0-rc0             5.17.0-rc0
                                      vanilla       sched-numaimb-v5
    Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
    Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
    Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
    Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
    CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
    
    It can also make a big difference for semi-realistic workloads
    like specjbb which can execute arbitrary numbers of threads without
    advance knowledge of how they should be placed. Even in cases where
    the average performance is neutral, the results are more stable.
    
                                   5.17.0-rc0             5.17.0-rc0
                                      vanilla       sched-numaimb-v6
    Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
    Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
    Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
    Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
    Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
    Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
    Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
    Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
    Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
    Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
    Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
    Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
    Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
    Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
    Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
    Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
    Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
    Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
    Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
    Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
    Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
    Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
    Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
    Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
    Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
    Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
    Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
    Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
    Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
    Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
    Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
    Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
    Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
    Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
    
    Similarly, for embarassingly parallel problems like NPB-ep, there are
    improvements due to better spreading across LLC when the machine is not
    fully utilised.
    
                                  vanilla       sched-numaimb-v6
    Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
    Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
    Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
    CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
    Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
    
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
    gormanm authored and Peter Zijlstra committed Feb 11, 2022
  3. sched/fair: Improve consistency of allowed NUMA balance calculations

    There are inconsistencies when determining if a NUMA imbalance is allowed
    that should be corrected.
    
    o allow_numa_imbalance changes types and is not always examining
      the destination group so both the type should be corrected as
      well as the naming.
    o find_idlest_group uses the sched_domain's weight instead of the
      group weight which is different to find_busiest_group
    o find_busiest_group uses the source group instead of the destination
      which is different to task_numa_find_cpu
    o Both find_idlest_group and find_busiest_group should account
      for the number of running tasks if a move was allowed to be
      consistent with task_numa_find_cpu
    
    Fixes: 7d2b5dd ("sched/numa: Allow a floating imbalance between NUMA nodes")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
    Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net
    gormanm authored and Peter Zijlstra committed Feb 11, 2022
  4. selftests/rseq: Change type of rseq_offset to ptrdiff_t

    Just before the 2.35 release of glibc, the __rseq_offset userspace ABI
    was changed from int to ptrdiff_t.
    
    Adapt to this change in the kernel selftests.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://sourceware.org/pipermail/libc-alpha/2022-February/136024.html
    compudj authored and Peter Zijlstra committed Feb 11, 2022

Commits on Feb 2, 2022

  1. sched: move autogroup sysctls into its own file

    move autogroup sysctls to autogroup.c and use the new
    register_sysctl_init() to register the sysctl interface.
    
    Signed-off-by: Zhen Ni <nizhen@uniontech.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220128095025.8745-1-nizhen@uniontech.com
    nizhenth authored and Peter Zijlstra committed Feb 2, 2022
  2. selftests/rseq: x86-32: use %gs segment selector for accessing rseq t…

    …hread area
    
    Rather than use rseq_get_abi() and pass its result through a register to
    the inline assembler, directly access the per-thread rseq area through a
    memory reference combining the %gs segment selector, the constant offset
    of the field in struct rseq, and the rseq_offset value (in a register).
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-16-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  3. selftests/rseq: x86-64: use %fs segment selector for accessing rseq t…

    …hread area
    
    Rather than use rseq_get_abi() and pass its result through a register to
    the inline assembler, directly access the per-thread rseq area through a
    memory reference combining the %fs segment selector, the constant offset
    of the field in struct rseq, and the rseq_offset value (in a register).
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-15-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  4. selftests/rseq: Fix: work-around asm goto compiler bugs

    gcc and clang each have their own compiler bugs with respect to asm
    goto. Implement a work-around for compiler versions known to have those
    bugs.
    
    gcc prior to 4.8.2 miscompiles asm goto.
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670
    
    gcc prior to 8.1.0 miscompiles asm goto at O1.
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103908
    
    clang prior to version 13.0.1 miscompiles asm goto at O2.
    llvm/llvm-project#52735
    
    Work around these issues by adding a volatile inline asm with
    memory clobber in the fallthrough after the asm goto and at each
    label target.  Emit this for all compilers in case other similar
    issues are found in the future.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-14-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  5. selftests/rseq: Remove arm/mips asm goto compiler work-around

    The arm and mips work-around for asm goto size guess issues are not
    properly documented, and lack reference to specific compiler versions,
    upstream compiler bug tracker entry, and reproducer.
    
    I can only find a loosely documented patch in my original LKML rseq post
    refering to gcc < 7 on ARM, but it does not appear to be sufficient to
    track the exact issue. Also, I am not sure MIPS really has the same
    limitation.
    
    Therefore, remove the work-around until we can properly document this.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/lkml/20171121141900.18471-17-mathieu.desnoyers@efficios.com/
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  6. selftests/rseq: Fix warnings about #if checks of undefined tokens

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-12-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  7. selftests/rseq: Fix ppc32 offsets by using long rather than off_t

    The semantic of off_t is for file offsets. We mean to use it as an
    offset from a pointer. We really expect it to fit in a single register,
    and not use a 64-bit type on 32-bit architectures.
    
    Fix runtime issues on ppc32 where the offset is always 0 due to
    inconsistency between the argument type (off_t -> 64-bit) and type
    expected by the inline assembler (32-bit).
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-11-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  8. selftests/rseq: Fix ppc32 missing instruction selection "u" and "x" f…

    …or load/store
    
    Building the rseq basic test  with
    gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
    Target: powerpc-linux-gnu
    
    leads to these errors:
    
    /tmp/ccieEWxU.s: Assembler messages:
    /tmp/ccieEWxU.s:118: Error: syntax error; found `,', expected `('
    /tmp/ccieEWxU.s:118: Error: junk at end of line: `,8'
    /tmp/ccieEWxU.s:121: Error: syntax error; found `,', expected `('
    /tmp/ccieEWxU.s:121: Error: junk at end of line: `,8'
    /tmp/ccieEWxU.s:626: Error: syntax error; found `,', expected `('
    /tmp/ccieEWxU.s:626: Error: junk at end of line: `,8'
    /tmp/ccieEWxU.s:629: Error: syntax error; found `,', expected `('
    /tmp/ccieEWxU.s:629: Error: junk at end of line: `,8'
    /tmp/ccieEWxU.s:735: Error: syntax error; found `,', expected `('
    /tmp/ccieEWxU.s:735: Error: junk at end of line: `,8'
    /tmp/ccieEWxU.s:738: Error: syntax error; found `,', expected `('
    /tmp/ccieEWxU.s:738: Error: junk at end of line: `,8'
    /tmp/ccieEWxU.s:741: Error: syntax error; found `,', expected `('
    /tmp/ccieEWxU.s:741: Error: junk at end of line: `,8'
    Makefile:581: recipe for target 'basic_percpu_ops_test.o' failed
    
    Based on discussion with Linux powerpc maintainers and review of
    the use of the "m" operand in powerpc kernel code, add the missing
    %Un%Xn (where n is operand number) to the lwz, stw, ld, and std
    instructions when used with "m" operands.
    
    Using "WORD" to mean either a 32-bit or 64-bit type depending on
    the architecture is misleading. The term "WORD" really means a
    32-bit type in both 32-bit and 64-bit powerpc assembler. The intent
    here is to wrap load/store to intptr_t into common macros for both
    32-bit and 64-bit.
    
    Rename the macros with a RSEQ_ prefix, and use the terms "INT"
    for always 32-bit type, and "LONG" for architecture bitness-sized
    type.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-10-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  9. selftests/rseq: Fix ppc32: wrong rseq_cs 32-bit field pointer on big …

    …endian
    
    ppc32 incorrectly uses padding as rseq_cs pointer field. Fix this by
    using the rseq_cs.arch.ptr field.
    
    Use this field across all architectures.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-9-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  10. selftests/rseq: Uplift rseq selftests for compatibility with glibc-2.35

    glibc-2.35 (upcoming release date 2022-02-01) exposes the rseq per-thread
    data in the TCB, accessible at an offset from the thread pointer, rather
    than through an actual Thread-Local Storage (TLS) variable, as the
    Linux kernel selftests initially expected.
    
    The __rseq_abi TLS and glibc-2.35's ABI for per-thread data cannot
    actively coexist in a process, because the kernel supports only a single
    rseq registration per thread.
    
    Here is the scheme introduced to ensure selftests can work both with an
    older glibc and with glibc-2.35+:
    
    - librseq exposes its own "rseq_offset, rseq_size, rseq_flags" ABI.
    
    - librseq queries for glibc rseq ABI (__rseq_offset, __rseq_size,
      __rseq_flags) using dlsym() in a librseq library constructor. If those
      are found, copy their values into rseq_offset, rseq_size, and
      rseq_flags.
    
    - Else, if those glibc symbols are not found, handle rseq registration
      from librseq and use its own IE-model TLS to implement the rseq ABI
      per-thread storage.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-8-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  11. selftests/rseq: Introduce thread pointer getters

    This is done in preparation for the selftest uplift to become compatible
    with glibc-2.35.
    
    glibc-2.35 exposes the rseq per-thread data in the TCB, accessible
    at an offset from the thread pointer.
    
    The toolchains do not implement accessing the thread pointer on all
    architectures. Provide thread pointer getters for ppc and x86 which
    lack (or lacked until recently) toolchain support.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-7-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  12. selftests/rseq: Introduce rseq_get_abi() helper

    This is done in preparation for the selftest uplift to become compatible
    with glibc-2.35.
    
    glibc-2.35 exposes the rseq per-thread data in the TCB, accessible
    at an offset from the thread pointer, rather than through an actual
    Thread-Local Storage (TLS) variable, as the kernel selftests initially
    expected.
    
    Introduce a rseq_get_abi() helper, initially using the __rseq_abi
    TLS variable, in preparation for changing this userspace ABI for one
    which is compatible with glibc-2.35.
    
    Note that the __rseq_abi TLS and glibc-2.35's ABI for per-thread data
    cannot actively coexist in a process, because the kernel supports only
    a single rseq registration per thread.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-6-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  13. selftests/rseq: Remove volatile from __rseq_abi

    This is done in preparation for the selftest uplift to become compatible
    with glibc-2.35.
    
    All accesses to the __rseq_abi fields are volatile, but remove the
    volatile from the TLS variable declaration, otherwise we are stuck with
    volatile for the upcoming rseq_get_abi() helper.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-5-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  14. selftests/rseq: Remove useless assignment to cpu variable

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-4-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  15. rseq: Remove broken uapi field layout on 32-bit little endian

    The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
    entirely wrong on 32-bit little endian: a preprocessor logic mistake
    wrongly uses the big endian field layout on 32-bit little endian
    architectures.
    
    Fortunately, those ptr32 accessors were never used within the kernel,
    and only meant as a convenience for user-space.
    
    Remove those and replace the whole rseq_cs union by a __u64 type, as
    this is the only thing really needed to express the ABI. Document how
    32-bit architectures are meant to interact with this field.
    
    Fixes: ec9c82e ("rseq: uapi: Declare rseq_cs field as union, update includes")
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220127152720.25898-1-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022
  16. selftests/rseq: introduce own copy of rseq uapi header

    The Linux kernel rseq uapi header has a broken layout for the
    rseq_cs.ptr field on 32-bit little endian architectures. The entire
    rseq_cs.ptr field is planned for removal, leaving only the 64-bit
    rseq_cs.ptr64 field available.
    
    Both glibc and librseq use their own copy of the Linux kernel uapi
    header, where they introduce proper union fields to access to the 32-bit
    low order bits of the rseq_cs pointer on 32-bit architectures.
    
    Introduce a copy of the Linux kernel uapi headers in the Linux kernel
    selftests.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220124171253.22072-2-mathieu.desnoyers@efficios.com
    compudj authored and Peter Zijlstra committed Feb 2, 2022

Commits on Jan 27, 2022

  1. psi: Fix "no previous prototype" warnings when CONFIG_CGROUPS=n

    When CONFIG_CGROUPS is disabled psi code generates the following warnings:
    
    kernel/sched/psi.c:1112:21: warning: no previous prototype for 'psi_trigger_create' [-Wmissing-prototypes]
        1112 | struct psi_trigger *psi_trigger_create(struct psi_group *group,
             |                     ^~~~~~~~~~~~~~~~~~
    kernel/sched/psi.c:1182:6: warning: no previous prototype for 'psi_trigger_destroy' [-Wmissing-prototypes]
        1182 | void psi_trigger_destroy(struct psi_trigger *t)
             |      ^~~~~~~~~~~~~~~~~~~
    kernel/sched/psi.c:1249:10: warning: no previous prototype for 'psi_trigger_poll' [-Wmissing-prototypes]
        1249 | __poll_t psi_trigger_poll(void **trigger_ptr,
             |          ^~~~~~~~~~~~~~~~
    
    Change declarations of these functions in the header to provide the
    prototypes even when they are unused.
    
    Fixes: 0e94682 ("psi: introduce psi monitor")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220119223940.787748-2-surenb@google.com
    surenbaghdasaryan authored and Peter Zijlstra committed Jan 27, 2022
  2. psi: Fix "defined but not used" warnings when CONFIG_PROC_FS=n

    When CONFIG_PROC_FS is disabled psi code generates the following warnings:
    
    kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
        1364 | static const struct proc_ops psi_cpu_proc_ops = {
             |                              ^~~~~~~~~~~~~~~~
    kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
        1355 | static const struct proc_ops psi_memory_proc_ops = {
             |                              ^~~~~~~~~~~~~~~~~~~
    kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
        1346 | static const struct proc_ops psi_io_proc_ops = {
             |                              ^~~~~~~~~~~~~~~
    
    Make definitions of these structures and related functions conditional on
    CONFIG_PROC_FS config.
    
    Fixes: 0e94682 ("psi: introduce psi monitor")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com
    surenbaghdasaryan authored and Peter Zijlstra committed Jan 27, 2022
  3. sched/uclamp: Fix iowait boost escaping uclamp restriction

    iowait_boost signal is applied independently of util and doesn't take
    into account uclamp settings of the rq. An io heavy task that is capped
    by uclamp_max could still request higher frequency because
    sugov_iowait_apply() doesn't clamp the boost via uclamp_rq_util_with()
    like effective_cpu_util() does.
    
    Make sure that iowait_boost honours uclamp requests by calling
    uclamp_rq_util_with() when applying the boost.
    
    Fixes: 982d9cd ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks")
    Signed-off-by: Qais Yousef <qais.yousef@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Link: https://lore.kernel.org/r/20211216225320.2957053-3-qais.yousef@arm.com
    qais-yousef authored and Peter Zijlstra committed Jan 27, 2022
  4. sched/sugov: Ignore 'busy' filter when rq is capped by uclamp_max

    sugov_update_single_{freq, perf}() contains a 'busy' filter that ensures
    we don't bring the frqeuency down if there's no idle time (CPU is busy).
    
    The problem is that with uclamp_max we will have scenarios where a busy
    task is capped to run at a lower frequency and this filter prevents
    applying the capping when this task starts running.
    
    We handle this by skipping the filter when uclamp is enabled and the rq
    is being capped by uclamp_max.
    
    We introduce a new function uclamp_rq_is_capped() to help detecting when
    this capping is taking effect. Some code shuffling was required to allow
    using cpu_util_{cfs, rt}() in this new function.
    
    On 2 Core SMT2 Intel laptop I see:
    
    Without this patch:
    
    	uclampset -M 0 sysbench --test=cpu --threads = 4 run
    
    produces a score of ~3200 consistently. Which is the highest possible.
    
    Compiling the kernel also results in frequency running at max 3.1GHz all
    the time - running uclampset -M 400 to cap it has no effect without this
    patch.
    
    With this patch:
    
    	uclampset -M 0 sysbench --test=cpu --threads = 4 run
    
    produces a score of ~1100 with some outliers in ~1700. Uclamp max
    aggregates the performance requirements, so having high values sometimes
    is expected if some other task happens to require that frequency starts
    running at the same time.
    
    When compiling the kernel with uclampset -M 400 I can see the
    frequencies mostly in the ~2GHz region. Helpful to conserve power and
    prevent heating when not plugged in.
    
    Fixes: 982d9cd ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks")
    Signed-off-by: Qais Yousef <qais.yousef@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211216225320.2957053-2-qais.yousef@arm.com
    qais-yousef authored and Peter Zijlstra committed Jan 27, 2022
  5. sched/core: Export pelt_thermal_tp

    We can't use this tracepoint in modules without having the symbol
    exported first, fix that.
    
    Fixes: 7650479 ("sched/pelt: Add support to track thermal pressure")
    Signed-off-by: Qais Yousef <qais.yousef@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com
    qais-yousef authored and Peter Zijlstra committed Jan 27, 2022
  6. MAINTAINERS: add Suren as psi co-maintainer

    Suren wrote the poll() interface, which is a significant part of the
    psi code and represents a large user of psi itself (Android). It's a
    good idea to have him look at psi patches as well, and it's good to
    have two people following things in case one of us is traveling.
    
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220117120317.1581315-1-hannes@cmpxchg.org
    hnaz authored and Peter Zijlstra committed Jan 27, 2022
  7. sched/numa: initialize numa statistics when forking new task

    The child processes will inherit numa_pages_migrated and
    total_numa_faults from the parent. It means even if there is no numa
    fault happen on the child, the statistics in /proc/$pid of the child
    process might show huge amount. This is a bit weird. Let's initialize
    them when do fork.
    
    Signed-off-by: Honglei Wang <wanghonglei@didichuxing.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20220113133920.49900-1-wanghonglei@didichuxing.com
    Honglei Wang authored and Peter Zijlstra committed Jan 27, 2022
Older