Skip to content
Permalink
Hugh-Dickins/m…
Switch branches/tags

Commits on Feb 15, 2022

  1. mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP

    4.8 commit 7751b2d ("vmscan: split file huge pages before paging
    them out") inserted a split_huge_page_to_list() into shrink_page_list()
    without considering the mlock case: no problem if the page has already
    been marked as Mlocked (the !page_evictable check much higher up will
    have skipped all this), but it has always been the case that races or
    omissions in setting Mlocked can rely on page reclaim to detect this
    and correct it before actually reclaiming - and that remains so, but
    what a shame if a hugepage is needlessly split before discovering it.
    
    It is surprising that page_check_references() returns PAGEREF_RECLAIM
    when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
    is where the condition is detected and corrected; and until now it could
    not be done in page_referenced_one(), because that does not always have
    the page locked.  Now that mlock's requirement for page lock has gone,
    copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
    and let page_check_references() return PAGEREF_ACTIVATE in this case.
    
    But page_referenced_one() may find a pte mapping one part of a hugepage:
    what hold should a pte mapped in a VM_LOCKED area exert over the entire
    huge page?  That's debatable.  The approach taken here is to treat that
    pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
    VM_LOCKED pmd mapping is found later in the walk, and lack of reference
    permits, then PAGEREF_RECLAIM take it to attempted splitting as before.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  2. mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)

    collapse_file() is using unmap_mapping_pages(1) on each small page found
    mapped, unlike others (reclaim, migration, splitting, memory-failure) who
    use try_to_unmap().  There are four advantages to try_to_unmap(): first,
    its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
    second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
    third, it breaks out early if page is not mapped everywhere it might be;
    fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
    save up all the TLB flushing until all of the pages have been unmapped.
    
    Wild guess: perhaps it was originally written to use try_to_unmap(),
    but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
    TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
    skips them, so fixed the issue.  I did once hit that VM_BUG_ON_PAGE()
    since making this change: we could pass TTU_SYNC here, but I think just
    delete the check - the race is very rare, this is an ordinary small page
    so we don't need to be so paranoid about mapcount surprises, and the
    page_ref_freeze() just below already handles the case adequately.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  3. mm/munlock: page migration needs mlock pagevec drained

    Page migration of a VM_LOCKED page tends to fail, because when the old
    page is unmapped, it is put on the mlock pagevec with raised refcount,
    which then fails the freeze.
    
    At first I thought this would be fixed by a local mlock_page_drain() at
    the upper rmap_walk() level - which would have nicely batched all the
    munlocks of that page; but tests show that the task can too easily move
    to another cpu, leaving pagevec residue behind which fails the migration.
    
    So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
    from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
    TTU_IGNORE_MLOCK users would want the same treatment; and do the same
    in remove_migration_pte() - not important when successfully inserting
    a new page, but necessary when hoping to retry after failure.
    
    Any new pagevec runs the risk of adding a new way of stranding, and we
    might discover other corners where mlock_page_drain() or lru_add_drain()
    would now help.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  4. mm/munlock: mlock_page() munlock_page() batch by pagevec

    A weakness of the page->mlock_count approach is the need for lruvec lock
    while holding page table lock.  That is not an overhead we would allow on
    normal pages, but I think acceptable just for pages in an mlocked area.
    But let's try to amortize the extra cost by gathering on per-cpu pagevec
    before acquiring the lruvec lock.
    
    I have an unverified conjecture that the mlock pagevec might work out
    well for delaying the mlock processing of new file pages until they have
    got off lru_cache_add()'s pagevec and on to LRU.
    
    The initialization of page->mlock_count is subject to races and awkward:
    0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
    this commit, which just widens the window?  I haven't gone back to think
    it through.  Maybe someone can point out a better way to initialize it.
    
    Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
    into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
    rather than lru_cache_add()'s pagevec.
    
    Experimented with various orderings: the right thing seems to be for
    mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
    pagevec, but munlock_page() to leave TestClearPageMlocked to the later
    pagevec processing.
    
    Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
    their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
    for that.
    
    This still leaves acquiring lruvec locks under page table lock each time
    the pagevec fills (or a THP is added): which I suppose is rather silly,
    since they sit on pagevec waiting to be processed long after page table
    lock has been dropped; but I'm disinclined to uglify the calling sequence
    until some load shows an actual problem with it (nothing wrong with
    taking lruvec lock under page table lock, just "nicer" to do it less).
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  5. mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()

    My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
    says that it can now be deleted; and that remains so when the next patch
    is added.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  6. mm/migrate: __unmap_and_move() push good newpage to LRU

    Compaction, NUMA page movement, THP collapse/split, and memory failure
    do isolate unevictable pages from their "LRU", losing the record of
    mlock_count in doing so (isolators are likely to use page->lru for their
    own private lists, so mlock_count has to be presumed lost).
    
    That's unfortunate, and we should put in some work to correct that: one
    can imagine a function to build up the mlock_count again - but it would
    require i_mmap_rwsem for read, so be careful where it's called.  Or
    page_referenced_one() and try_to_unmap_one() might do that extra work.
    
    But one place that can very easily be improved is page migration's
    __unmap_and_move(): a small adjustment to where the successful new page
    is put back on LRU, and its mlock_count (if any) is built back up by
    remove_migration_ptes().
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  7. mm/munlock: mlock_pte_range() when mlocking or munlocking

    Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
    required to lower the mlock_counts when munlocking without munmapping;
    and its complement, implementation of mlock_vma_pages_range(), required
    to raise the mlock_counts on pages already there when a range is mlocked.
    
    Combine them into just the one function mlock_vma_pages_range(), using
    walk_page_range() to run mlock_pte_range().  This approach fixes the
    "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
    https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/
    
    Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
    a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
    pte first, it will not try to munlock the page: leaving release_pages()
    to correct it when the last reference to the page is gone - that's okay,
    a page is not evictable anyway while it is held by an extra reference.
    
    Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
    a racing remove_migration_pte() or try_to_unmap_one() (depending on
    i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
    then mlock_pte_range() mlock it a second time.  This is harder to
    reproduce, but a more serious race because it could leave the page
    unevictable indefinitely though the area is munlocked afterwards.
    Guard against it by setting the (inappropriate) VM_IO flag,
    and modifying mlock_vma_page() to decline such vmas.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  8. mm/munlock: maintain page->mlock_count while unevictable

    Previous patches have been preparatory: now implement page->mlock_count.
    The ordering of the "Unevictable LRU" is of no significance, and there is
    no point holding unevictable pages on a list: place page->mlock_count to
    overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
    which needs to be even so as not to satisfy PageTail - though 2 could be
    added instead of 1 for each mlock, if that's ever an improvement).
    
    But it's only safe to rely on or modify page->mlock_count while lruvec
    lock is held and page is on unevictable "LRU" - we can save lots of edits
    by continuing to pretend that there's an imaginary LRU here (there is an
    unevictable count which still needs to be maintained, but not a list).
    
    The mlock_count technique suffers from an unreliability much like with
    page_mlock(): while someone else has the page off LRU, not much can
    be done.  As before, err on the safe side (behave as if mlock_count 0),
    and let try_to_unlock_one() move the page to unevictable if reclaim finds
    out later on - a few misplaced pages don't matter, what we want to avoid
    is imbalancing reclaim by flooding evictable lists with unevictable pages.
    
    I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
    if we have taken lruvec lock to get the page off its present list, then
    we save everyone trouble (and however many extra atomic ops) by putting
    it on its destination list immediately.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  9. mm/munlock: replace clear_page_mlock() by final clearance

    Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
    of the munlocking to clear_page_mlock(), since PageMlocked is typically
    still set when mapcount has fallen to 0.  That is not what we want: we
    want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
    on the integrity of of the mlock/munlock protocol - small numbers are
    not surprising, but big numbers mean the protocol is not working.
    
    That could be easily fixed by placing munlock_vma_page() at the start of
    page_remove_rmap(); but later in the series we shall want to batch the
    munlocking, and that too would tend to leave PageMlocked still set at
    the point when it is checked.
    
    So delete clear_page_mlock() now: leave it instead to release_pages()
    (and __page_cache_release()) to do this backstop clearing of Mlocked,
    when page refcount has fallen to 0.  If a pinned page occasionally gets
    counted as Mlocked and Unevictable until it is unpinned, that's okay.
    
    A slightly regrettable side-effect of this change is that, since
    release_pages() and __page_cache_release() may be called at interrupt
    time, those places which update NR_MLOCK with interrupts enabled
    had better use mod_zone_page_state() than __mod_zone_page_state()
    (but holding the lruvec lock always has interrupts disabled).
    
    This change, forcing Mlocked off when refcount 0 instead of earlier
    when mapcount 0, is not fundamental: it can be reversed if performance
    or something else is found to suffer; but this is the easiest way to
    separate the stats - let's not complicate that without good reason.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  10. mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

    Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
    inline functions which check (vma->vm_flags & VM_LOCKED) before calling
    mlock_page() and munlock_page() in mm/mlock.c.
    
    Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
    because we have understandable difficulty in accounting pte maps of THPs,
    and if passed a PageHead page, mlock_page() and munlock_page() cannot
    tell whether it's a pmd map to be counted or a pte map to be ignored.
    
    Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
    others, and use that to call mlock_vma_page() at the end of the page
    adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
    beginning? unimportant, but end was easier for assertions in testing).
    
    No page lock is required (although almost all adds happen to hold it):
    delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
    Certainly page lock did serialize with page migration, but I'm having
    difficulty explaining why that was ever important.
    
    Mlock accounting on THPs has been hard to define, differed between anon
    and file, involved PageDoubleMap in some places and not others, required
    clear_page_mlock() at some points.  Keep it simple now: just count the
    pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
    
    page_add_new_anon_rmap() callers unchanged: they have long been calling
    lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
    handling (it also checks for not VM_SPECIAL: I think that's overcautious,
    and inconsistent with other checks, that mmap_region() already prevents
    VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  11. mm/munlock: delete munlock_vma_pages_all(), allow oomreap

    munlock_vma_pages_range() will still be required, when munlocking but
    not munmapping a set of pages; but when unmapping a pte, the mlock count
    will be maintained in much the same way as it will be maintained when
    mapping in the pte.  Which removes the need for munlock_vma_pages_all()
    on mlocked vmas when munmapping or exiting: eliminating the catastrophic
    contention on i_mmap_rwsem, and the need for page lock on the pages.
    
    There is still a need to update locked_vm accounting according to the
    munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
    exit_mmap() does not need locked_vm updates, so delete unlock_range().
    
    And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
    because of the uncertainty in blocking on all those page locks?
    No fear of that now, so permit the OOM reaper on mlocked vmas.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  12. mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE

    If counting page mlocks, we must not double-count: follow_page_pte() can
    tell if a page has already been Mlocked or not, but cannot tell if a pte
    has already been counted or not: that will have to be done when the pte
    is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
    for new anon pages, but there's no such tracking yet for others).
    
    Delete all the FOLL_MLOCK code - faulting in the missing pages will do
    all that is necessary, without special mlock_vma_page() calls from here.
    
    But then FOLL_POPULATE turns out to serve no purpose - it was there so
    that its absence would tell faultin_page() not to faultin page when
    setting up VM_LOCKONFAULT areas; but if there's no special work needed
    here for mlock, then there's no work at all here for VM_LOCKONFAULT.
    
    Have I got that right?  I've not looked into the history, but see that
    FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
    purpose before?  Ah, yes, it was used to skip the old stack guard page.
    
    And is it intentional that COW is not broken on existing pages when
    setting up a VM_LOCKONFAULT area?  I can see that being argued either
    way, and have no reason to disagree with current behaviour.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022
  13. mm/munlock: delete page_mlock() and all its works

    We have recommended some applications to mlock their userspace, but that
    turns out to be counter-productive: when many processes mlock the same
    file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
    is needed for write, to remove any vma mapping that file from rmap's tree;
    but hogged for read by those with mlocks calling page_mlock() (formerly
    known as try_to_munlock()) on *each* page mapped from the file (the
    purpose being to find out whether another process has the page mlocked,
    so therefore it should not be unmlocked yet).
    
    Several optimizations have been made in the past: one is to skip
    page_mlock() when mapcount tells that nothing else has this page
    mapped; but that doesn't help at all when others do have it mapped.
    This time around, I initially intended to add a preliminary search
    of the rmap tree for overlapping VM_LOCKED ranges; but that gets
    messy with locking order, when in doubt whether a page is actually
    present; and risks adding even more contention on the i_mmap_rwsem.
    
    A solution would be much easier, if only there were space in struct page
    for an mlock_count... but actually, most of the time, there is space for
    it - an mlocked page spends most of its life on an unevictable LRU, but
    since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
    been redundant.  Let's try to reuse its page->lru.
    
    But leave that until a later patch: in this patch, clear the ground by
    removing page_mlock(), and all the infrastructure that has gathered
    around it - which mostly hinders understanding, and will make reviewing
    new additions harder.  Don't mind those old comments about THPs, they
    date from before 4.5's refcounting rework: splitting is not a risk here.
    
    Just keep a minimal version of munlock_vma_page(), as reminder of what it
    should attend to (in particular, the odd way PGSTRANDED is counted out of
    PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
    unchanged __mlock_posix_error_return() out of the way, down to above its
    caller: this series then makes no further change after mlock_fixup().
    
    After this and each following commit, the kernel builds, boots and runs;
    but with deficiencies which may show up in testing of mlock and munlock.
    The system calls succeed or fail as before, and mlock remains effective
    in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
    may be shown too low after mlock, grow, then stay too high after munlock:
    with previously mlocked pages remaining unevictable for too long, until
    finally unmapped and freed and counts corrected. Normal service will be
    resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Hugh Dickins authored and intel-lab-lkp committed Feb 15, 2022

Commits on Feb 2, 2022

  1. perf/x86/intel: Increase max number of the fixed counters

    The new PEBS format 5 implies that the number of the fixed counters can
    be up to 16. The current INTEL_PMC_MAX_FIXED is still 4. If the current
    kernel runs on a future platform which has more than 4 fixed counters,
    a warning will be triggered. The number of the fixed counters will be
    clipped to 4. Users have to upgrade the kernel to access the new fixed
    counters.
    
    Add a new default constraint for PerfMon v5 and up, which can support
    up to 16 fixed counters. The pseudo-encoding is applied for the fixed
    counters 4 and later. The user can have generic support for the new
    fixed counters on the future platfroms without updating the kernel.
    
    Increase the INTEL_PMC_MAX_FIXED to 16.
    
    Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Andi Kleen <ak@linux.intel.com>
    Link: https://lkml.kernel.org/r/1643750603-100733-3-git-send-email-kan.liang@linux.intel.com
    Kan Liang authored and Peter Zijlstra committed Feb 2, 2022
  2. KVM: x86: use the KVM side max supported fixed counter

    KVM vPMU doesn't support to emulate all the fixed counters that the
    host PMU driver has supported, e.g. the fixed counter 3 used by
    Topdown metrics hasn't been supported by KVM so far.
    
    Rename MAX_FIXED_COUNTERS to KVM_PMC_MAX_FIXED to have a more
    straightforward naming convention as INTEL_PMC_MAX_FIXED used by the
    host PMU driver, and fix vPMU to use the KVM side KVM_PMC_MAX_FIXED
    for the virtual fixed counter emulation, instead of the host side
    INTEL_PMC_MAX_FIXED.
    
    Signed-off-by: Wei Wang <wei.w.wang@intel.com>
    Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/1643750603-100733-2-git-send-email-kan.liang@linux.intel.com
    wei-w-wang authored and Peter Zijlstra committed Feb 2, 2022
  3. perf/x86/intel: Enable PEBS format 5

    The new PEBS Record Format 5 is similar to the PEBS Record Format 4. The
    only difference is the layout of the Counter Reset fields of the PEBS
    Config Buffer in the DS area. For the PEBS format 4, the Counter Reset
    fields allocation is for 8 general-purpose counters followed by 4
    fixed-function counters. For the PEBS format 5, the Counter Reset fields
    allocation is for 32 general-purpose counters followed by 16
    fixed-function counters.
    
    Extend the MAX_PEBS_EVENTS to 32. Add MAX_PEBS_EVENTS_FMT4 for the
    previous platform. Except for the DS auto-reload code, other places
    already assume 32 counters. Only check the PEBS_FMT in the DS
    auto-reload code.
    
    Extend the MAX_FIXED_PEBS_EVENTS to 16, which only impacts the size of
    struct debug_store and some local temporary variables. The size of
    struct debug_store increases 288B, which is small and should be
    acceptable.
    
    Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/1643750603-100733-1-git-send-email-kan.liang@linux.intel.com
    Kan Liang authored and Peter Zijlstra committed Feb 2, 2022
  4. perf/core: Allow kernel address filter when not filtering the kernel

    The so-called 'kernel' address filter can also be useful for filtering
    fixed addresses in user space.  Allow that.
    
    Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220131072453.2839535-6-adrian.hunter@intel.com
    ahunter6 authored and Peter Zijlstra committed Feb 2, 2022
  5. perf/x86/intel/pt: Fix address filter config for 32-bit kernel

    Change from shifting 'unsigned long' to 'u64' to prevent the config bits
    being lost on a 32-bit kernel.
    
    Fixes: eadf48c ("perf/x86/intel/pt: Add support for address range filtering in PT")
    Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220131072453.2839535-5-adrian.hunter@intel.com
    ahunter6 authored and Peter Zijlstra committed Feb 2, 2022
  6. perf/core: Fix address filter parser for multiple filters

    Reset appropriate variables in the parser loop between parsing separate
    filters, so that they do not interfere with parsing the next filter.
    
    Fixes: 375637b ("perf/core: Introduce address range filtering")
    Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220131072453.2839535-4-adrian.hunter@intel.com
    ahunter6 authored and Peter Zijlstra committed Feb 2, 2022
  7. x86: Share definition of __is_canonical_address()

    Reduce code duplication by moving canonical address code to a common header
    file.
    
    Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220131072453.2839535-3-adrian.hunter@intel.com
    ahunter6 authored and Peter Zijlstra committed Feb 2, 2022
  8. perf/x86/intel/pt: Relax address filter validation

    The requirement for 64-bit address filters is that they are canonical
    addresses. In other respects any address range is allowed which would
    include user space addresses.
    
    That can be useful for tracing virtual machine guests because address
    filtering can be used to advantage in place of current privilege level
    (CPL) filtering.
    
    Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220131072453.2839535-2-adrian.hunter@intel.com
    ahunter6 authored and Peter Zijlstra committed Feb 2, 2022

Commits on Jan 30, 2022

  1. Linux 5.17-rc2

    torvalds committed Jan 30, 2022
  2. Merge tag 'irq_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/s…

    …cm/linux/kernel/git/tip/tip
    
    Pull irq fixes from Borislav Petkov:
    
     - Drop an unused private data field in the AIC driver
    
     - Various fixes to the realtek-rtl driver
    
     - Make the GICv3 ITS driver compile again in !SMP configurations
    
     - Force reset of the GICv3 ITSs at probe time to avoid issues during kexec
    
     - Yet another kfree/bitmap_free conversion
    
     - Various DT updates (Renesas, SiFive)
    
    * tag 'irq_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
      dt-bindings: interrupt-controller: sifive,plic: Group interrupt tuples
      dt-bindings: interrupt-controller: sifive,plic: Fix number of interrupts
      dt-bindings: irqchip: renesas-irqc: Add R-Car V3U support
      irqchip/gic-v3-its: Reset each ITS's BASERn register before probe
      irqchip/gic-v3-its: Fix build for !SMP
      irqchip/loongson-pch-ms: Use bitmap_free() to free bitmap
      irqchip/realtek-rtl: Service all pending interrupts
      irqchip/realtek-rtl: Fix off-by-one in routing
      irqchip/realtek-rtl: Map control data to virq
      irqchip/apple-aic: Drop unused ipi_hwirq field
    torvalds committed Jan 30, 2022
  3. Merge tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/…

    …scm/linux/kernel/git/tip/tip
    
    Pull perf fixes from Borislav Petkov:
    
     - Prevent accesses to the per-CPU cgroup context list from another CPU
       except the one it belongs to, to avoid list corruption
    
     - Make sure parent events are always woken up to avoid indefinite hangs
       in the traced workload
    
    * tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
      perf/core: Fix cgroup event list management
      perf: Always wake the parent event
    torvalds committed Jan 30, 2022
  4. Merge tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub…

    …/scm/linux/kernel/git/tip/tip
    
    Pull scheduler fix from Borislav Petkov:
     "Make sure the membarrier-rseq fence commands are part of the reported
      set when querying membarrier(2) commands through MEMBARRIER_CMD_QUERY"
    
    * tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
      sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask
    torvalds committed Jan 30, 2022
  5. Merge tag 'x86_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/…

    …linux/kernel/git/tip/tip
    
    Pull x86 fixes from Borislav Petkov:
    
     - Add another Intel CPU model to the list of CPUs supporting the
       processor inventory unique number
    
     - Allow writing to MCE thresholding sysfs files again - a previous
       change had accidentally disabled it and no one noticed. Goes to show
       how much is this stuff used
    
    * tag 'x86_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
      x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN
      x86/MCE/AMD: Allow thresholding interface updates after init
    torvalds committed Jan 30, 2022
  6. Merge branch 'akpm' (patches from Andrew)

    Merge misc fixes from Andrew Morton:
     "12 patches.
    
      Subsystems affected by this patch series: sysctl, binfmt, ia64, mm
      (memory-failure, folios, kasan, and psi), selftests, and ocfs2"
    
    * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
      ocfs2: fix a deadlock when commit trans
      jbd2: export jbd2_journal_[grab|put]_journal_head
      psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n
      psi: fix "no previous prototype" warnings when CONFIG_CGROUPS=n
      mm, kasan: use compare-exchange operation to set KASAN page tag
      kasan: test: fix compatibility with FORTIFY_SOURCE
      tools/testing/scatterlist: add missing defines
      mm: page->mapping folio->mapping should have the same offset
      memory-failure: fetch compound_head after pgmap_pfn_valid()
      ia64: make IA64_MCA_RECOVERY bool instead of tristate
      binfmt_misc: fix crash when load/unload module
      include/linux/sysctl.h: fix register_sysctl_mount_point() return type
    torvalds committed Jan 30, 2022
  7. ocfs2: fix a deadlock when commit trans

    commit 6f1b228 introduces a regression which can deadlock as
    follows:
    
      Task1:                              Task2:
      jbd2_journal_commit_transaction     ocfs2_test_bg_bit_allocatable
      spin_lock(&jh->b_state_lock)        jbd_lock_bh_journal_head
      __jbd2_journal_remove_checkpoint    spin_lock(&jh->b_state_lock)
      jbd2_journal_put_journal_head
      jbd_lock_bh_journal_head
    
    Task1 and Task2 lock bh->b_state and jh->b_state_lock in different
    order, which finally result in a deadlock.
    
    So use jbd2_journal_[grab|put]_journal_head instead in
    ocfs2_test_bg_bit_allocatable() to fix it.
    
    Link: https://lkml.kernel.org/r/20220121071205.100648-3-joseph.qi@linux.alibaba.com
    Fixes: 6f1b228 ("ocfs2: fix race between searching chunks and release journal_head from buffer_head")
    Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Reported-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
    Tested-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
    Reported-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Gang He <ghe@suse.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    josephhz authored and torvalds committed Jan 30, 2022
  8. jbd2: export jbd2_journal_[grab|put]_journal_head

    Patch series "ocfs2: fix a deadlock case".
    
    This fixes a deadlock case in ocfs2.  We firstly export jbd2 symbols
    jbd2_journal_[grab|put]_journal_head as preparation and later use them
    in ocfs2 insread of jbd_[lock|unlock]_bh_journal_head to fix the
    deadlock.
    
    This patch (of 2):
    
    This exports symbols jbd2_journal_[grab|put]_journal_head, which will be
    used outside modules, e.g.  ocfs2.
    
    Link: https://lkml.kernel.org/r/20220121071205.100648-2-joseph.qi@linux.alibaba.com
    Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Gang He <ghe@suse.com>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
    Cc: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    josephhz authored and torvalds committed Jan 30, 2022
  9. psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n

    When CONFIG_PROC_FS is disabled psi code generates the following
    warnings:
    
      kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
          1364 | static const struct proc_ops psi_cpu_proc_ops = {
               |                              ^~~~~~~~~~~~~~~~
      kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
          1355 | static const struct proc_ops psi_memory_proc_ops = {
               |                              ^~~~~~~~~~~~~~~~~~~
      kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
          1346 | static const struct proc_ops psi_io_proc_ops = {
               |                              ^~~~~~~~~~~~~~~
    
    Make definitions of these structures and related functions conditional
    on CONFIG_PROC_FS config.
    
    Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com
    Fixes: 0e94682 ("psi: introduce psi monitor")
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    surenbaghdasaryan authored and torvalds committed Jan 30, 2022
  10. psi: fix "no previous prototype" warnings when CONFIG_CGROUPS=n

    When CONFIG_CGROUPS is disabled psi code generates the following
    warnings:
    
      kernel/sched/psi.c:1112:21: warning: no previous prototype for 'psi_trigger_create' [-Wmissing-prototypes]
          1112 | struct psi_trigger *psi_trigger_create(struct psi_group *group,
               |                     ^~~~~~~~~~~~~~~~~~
      kernel/sched/psi.c:1182:6: warning: no previous prototype for 'psi_trigger_destroy' [-Wmissing-prototypes]
          1182 | void psi_trigger_destroy(struct psi_trigger *t)
               |      ^~~~~~~~~~~~~~~~~~~
      kernel/sched/psi.c:1249:10: warning: no previous prototype for 'psi_trigger_poll' [-Wmissing-prototypes]
          1249 | __poll_t psi_trigger_poll(void **trigger_ptr,
               |          ^~~~~~~~~~~~~~~~
    
    Change the declarations of these functions in the header to provide the
    prototypes even when they are unused.
    
    Link: https://lkml.kernel.org/r/20220119223940.787748-2-surenb@google.com
    Fixes: 0e94682 ("psi: introduce psi monitor")
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    surenbaghdasaryan authored and torvalds committed Jan 30, 2022
  11. mm, kasan: use compare-exchange operation to set KASAN page tag

    It has been reported that the tag setting operation on newly-allocated
    pages can cause the page flags to be corrupted when performed
    concurrently with other flag updates as a result of the use of
    non-atomic operations.
    
    Fix the problem by using a compare-exchange loop to update the tag.
    
    Link: https://lkml.kernel.org/r/20220120020148.1632253-1-pcc@google.com
    Link: https://linux-review.googlesource.com/id/I456b24a2b9067d93968d43b4bb3351c0cec63101
    Fixes: 2813b9c ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
    Signed-off-by: Peter Collingbourne <pcc@google.com>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    pcc authored and torvalds committed Jan 30, 2022
  12. kasan: test: fix compatibility with FORTIFY_SOURCE

    With CONFIG_FORTIFY_SOURCE enabled, string functions will also perform
    dynamic checks using __builtin_object_size(ptr), which when failed will
    panic the kernel.
    
    Because the KASAN test deliberately performs out-of-bounds operations,
    the kernel panics with FORTIFY_SOURCE, for example:
    
     | kernel BUG at lib/string_helpers.c:910!
     | invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
     | CPU: 1 PID: 137 Comm: kunit_try_catch Tainted: G    B             5.16.0-rc3+ #3
     | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
     | RIP: 0010:fortify_panic+0x19/0x1b
     | ...
     | Call Trace:
     |  kmalloc_oob_in_memset.cold+0x16/0x16
     |  ...
    
    Fix it by also hiding `ptr` from the optimizer, which will ensure that
    __builtin_object_size() does not return a valid size, preventing
    fortified string functions from panicking.
    
    Link: https://lkml.kernel.org/r/20220124160744.1244685-1-elver@google.com
    Signed-off-by: Marco Elver <elver@google.com>
    Reported-by: Nico Pache <npache@redhat.com>
    Reviewed-by: Nico Pache <npache@redhat.com>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Brendan Higgins <brendanhiggins@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    melver authored and torvalds committed Jan 30, 2022
  13. tools/testing/scatterlist: add missing defines

    The cited commits replaced preemptible with pagefault_disabled and
    flush_kernel_dcache_page with flush_dcache_page respectively, hence need
    to update the corresponding defines in the test.
    
      scatterlist.c: In function ‘sg_miter_stop’:
      scatterlist.c:919:4: warning: implicit declaration of function ‘flush_dcache_page’ [-Wimplicit-function-declaration]
          flush_dcache_page(miter->page);
          ^~~~~~~~~~~~~~~~~
      In file included from linux/scatterlist.h:8:0,
                       from scatterlist.c:9:
      scatterlist.c:922:18: warning: implicit declaration of function ‘pagefault_disabled’ [-Wimplicit-function-declaration]
          WARN_ON_ONCE(!pagefault_disabled());
                        ^
      linux/mm.h:23:25: note: in definition of macro ‘WARN_ON_ONCE’
        int __ret_warn_on = !!(condition);                      \
                               ^~~~~~~~~
    
    Link: https://lkml.kernel.org/r/20220118082105.1737320-1-maorg@nvidia.com
    Fixes: 723aca2 ("mm/scatterlist: replace the !preemptible warning in sg_miter_stop()")
    Fixes: 0e84f5d ("scatterlist: replace flush_kernel_dcache_page with flush_dcache_page")
    Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
    Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    maorgottlieb authored and torvalds committed Jan 30, 2022
  14. mm: page->mapping folio->mapping should have the same offset

    As with the other members of folio, the offset of page->mapping and
    folio->mapping must be the same.  The compile-time check was
    inadvertently removed during development.  Add it back.
    
    [willy@infradead.org: changelog redo]
    
    Link: https://lkml.kernel.org/r/20220104011734.21714-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    RichardWeiYang authored and torvalds committed Jan 30, 2022
Older