Skip to content
Permalink
Peter-Xu/userf…
Switch branches/tags

Commits on Jul 14, 2021

  1. userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs

    After we added support for shmem and hugetlbfs, we can turn uffd-wp test on
    always now.
    
    Define HUGETLB_EXPECTED_IOCTLS to avoid using UFFD_API_RANGE_IOCTLS_BASIC,
    because UFFD_API_RANGE_IOCTLS_BASIC is normally a superset of capabilities,
    while the test may not satisfy them all.  E.g., when hugetlb registered without
    minor mode, then we need to explicitly remove _UFFDIO_CONTINUE.  Same thing to
    uffd-wp, as we'll need to explicitly remove _UFFDIO_WRITEPROTECT if not
    registered with uffd-wp.
    
    For the long term, we may consider dropping UFFD_API_* macros completely from
    uapi/linux/userfaultfd.h header files, because it may cause kernel header
    update to easily break userspace.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  2. mm/userfaultfd: Enable write protection for shmem & hugetlbfs

    We've had all the necessary changes ready for both shmem and hugetlbfs.  Turn
    on all the shmem/hugetlbfs switches for userfaultfd-wp.
    
    We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because
    all existing types now support write protection mode.
    
    Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  3. mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

    This requires the pagemap code to be able to recognize the newly introduced
    swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
    recently start to support.  It should make pagemap uffd-wp support complete.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  4. hugetlb/userfaultfd: Only drop uffd-wp special pte if required

    As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if
    unmapping an entire vma or synchronized such that faults can not race with the
    unmap operation.  This requires passing zap_flags all the way to the lowest
    level hugetlb unmap routine: __unmap_hugepage_range.
    
    In general, unmap calls originated in hugetlbfs code will pass the
    ZAP_FLAG_DROP_FILE_UFFD_WP flag as synchronization is in place to prevent
    faults.  The exception is hole punch which will first unmap without any
    synchronization.  Later when hole punch actually removes the page from the
    file, it will check to see if there was a subsequent fault and if so take the
    hugetlb fault mutex while unmapping again.  This second unmap will pass in
    ZAP_FLAG_DROP_FILE_UFFD_WP.
    
    The core justification of "whether to apply ZAP_FLAG_DROP_FILE_UFFD_WP flag
    when unmap a hugetlb range" is (IMHO): we should never reach a state when a
    page fault could errornously fault in a page-cache page that was wr-protected
    to be writable, even in an extremely short period.  That could happen if
    e.g. we pass ZAP_FLAG_DROP_FILE_UFFD_WP in hugetlbfs_punch_hole() when calling
    hugetlb_vmdelete_list(), because if a page fault triggers after that call and
    before the remove_inode_hugepages() right after it, the page cache can be
    mapped writable again in the small window, which can cause data corruption.
    
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  5. hugetlb/userfaultfd: Allow wr-protect none ptes

    Teach hugetlbfs code to wr-protect none ptes just in case the page cache
    existed for that pte.  Meanwhile we also need to be able to recognize a uffd-wp
    marker pte and remove it for uffd_wp_resolve.
    
    Since at it, introduce a variable "psize" to replace all references to the huge
    page size fetcher.
    
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  6. hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

    Teach the hugetlb page fault code to understand uffd-wp special pte.  For
    example, when seeing such a pte we need to convert any write fault into a read
    one (which is fake - we'll retry the write later if so).  Meanwhile, for
    handle_userfault() we'll need to make sure we must wait for the special swap
    pte too just like a none pte.
    
    Note that we also need to teach UFFDIO_COPY about this special pte across the
    code path so that we can safely install a new page at this special pte as long
    as we know it's a stall entry.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  7. mm/hugetlb: Introduce huge version of special swap pte helpers

    This is to let hugetlbfs be prepared to also recognize swap special ptes just
    like uffd-wp special swap ptes.
    
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  8. hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

    This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
    will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.
    
    huge_pte_clear_uffd_wp() is introduced to handle the case where the
    UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.
    
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  9. hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

    Firstly, pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout
    the stack.  Then, apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with
    UFFDIO_COPY.  Introduce huge_pte_mkuffd_wp() for it.
    
    Hugetlb pages are only managed by hugetlbfs, so we're safe even without setting
    dirty bit in the huge pte if the page is installed as read-only.  However we'd
    better still keep the dirty bit set for a read-only UFFDIO_COPY pte (when
    UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with shmem, but
    also because the page does contain dirty data that the kernel just copied from
    the userspace.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  10. hugetlb/userfaultfd: Hook page faults for uffd write protection

    Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.
    
    We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
    extra locks that we definitely don't need.
    
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  11. mm/hugetlb: Introduce huge pte version of uffd-wp helpers

    They will be used in the follow up patches to either check/set/clear uffd-wp
    bit of a huge pte.
    
    So far it reuses all the small pte helpers.  Archs can overwrite these versions
    when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the future.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  12. mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h

    Drop it in the header since it's only used in hugetlb.c.
    
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  13. shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

    It should be handled similarly like other uffd-wp wr-protected ptes: we should
    pass it over when the dst_vma has VM_UFFD_WP armed, otherwise drop it.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  14. shmem/userfaultfd: Handle the left-overed special swap ptes

    Note that the special uffd-wp swap pte can be left over even if the page under
    the pte got evicted.  Normally when evict a page, we will unmap the ptes by
    walking through the reverse mapping.  However we never tracked such information
    for the special swap ptes because they're not real mappings but just markers.
    So we need to take care of that when we see a marker but when it's actually
    meaningless (the page behind it got evicted).
    
    We have already taken care of that in e.g. alloc_set_pte() where we'll treat
    the special swap pte as pte_none() when necessary.  However we need to also
    teach userfaultfd itself on either UFFDIO_COPY or handling page faults, so that
    everything will still work as expected.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  15. shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps

    We don't have "huge" version of PTE_SWP_UFFD_WP_SPECIAL, instead when necessary
    we split the thp if the huge page is uffd wr-protected previously.
    
    However split the thp is not enough, because file-backed thp is handled totally
    differently comparing to anonymous thps - rather than doing a real split, the
    thp pmd will simply got dropped in __split_huge_pmd_locked().
    
    That is definitely not enough if e.g. when there is a thp covers range [0, 2M)
    but we want to wr-protect small page resides in [4K, 8K) range, because after
    __split_huge_pmd() returns, there will be a none pmd.
    
    Here we leverage the previously introduced change_protection_prepare() macro so
    that we'll populate the pmd with a pgtable page.  Then change_pte_range() will
    do all the rest for us, e.g., install the uffd-wp swap special pte marker at
    any pte that we'd like to wr-protect, under the protection of pgtable lock.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  16. shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

    File-backed memory differs from anonymous memory in that even if the pte is
    missing, the data could still resides either in the file or in page/swap cache.
    So when wr-protect a pte, we need to consider none ptes too.
    
    We do that by installing the uffd-wp special swap pte as a marker.  So when
    there's a future write to the pte, the fault handler will go the special path
    to first fault-in the page as read-only, then report to userfaultfd server with
    the wr-protect message.
    
    On the other hand, when unprotecting a page, it's also possible that the pte
    got unmapped but replaced by the special uffd-wp marker.  Then we'll need to be
    able to recover from a uffd-wp special swap pte into a none pte, so that the
    next access to the page will fault in correctly as usual when trigger the fault
    handler next time, rather than sending a uffd-wp message.
    
    Special care needs to be taken throughout the change_protection_range()
    process.  Since now we allow user to wr-protect a none pte, we need to be able
    to pre-populate the page table entries if we see !anonymous && MM_CP_UFFD_WP
    requests, otherwise change_protection_range() will always skip when the pgtable
    entry does not exist.
    
    Note that this patch only covers the small pages (pte level) but not covering
    any of the transparent huge pages yet.  But this will be a base for thps too.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  17. shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

    File-backed memory is prone to being unmapped at any time.  It means all
    information in the pte will be dropped, including the uffd-wp flag.
    
    Since the uffd-wp info cannot be stored in page cache or swap cache, persist
    this wr-protect information by installing the special uffd-wp marker pte when
    we're going to unmap a uffd wr-protected pte.  When the pte is accessed again,
    we will know it's previously wr-protected by recognizing the special pte.
    
    Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
    persist such an information.  For example, when destroying the whole vma, or
    punching a hole in a shmem file.  For the latter, we can only drop the uffd-wp
    bit when holding the page lock.  It means the unmap_mapping_range() in
    shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
    because that's still racy with the page faults.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  18. mm: Introduce ZAP_FLAG_SKIP_SWAP

    Firstly, the comment in zap_pte_range() is misleading because it checks against
    details rather than check_mappings, so it's against what the code did.
    
    Meanwhile, it's confusing too on not explaining why passing in the details
    pointer would mean to skip all swap entries.  New user of zap_details could
    very possibly miss this fact if they don't read deep until zap_pte_range()
    because there's no comment at zap_details talking about it at all, so swap
    entries could be errornously skipped without being noticed.
    
    This partly reverts 3e8715f ("mm: drop zap_details::check_swap_entries"),
    but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
    "details" parameter: the caller should explicitly set this to skip swap
    entries, otherwise swap entries will always be considered (which is still the
    major case here).
    
    Cc: Kirill A. Shutemov <kirill@shutemov.name>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  19. mm: Introduce zap_details.zap_flags

    Instead of trying to introduce one variable for every new zap_details fields,
    let's introduce a flag so that it can start to encode true/false informations.
    
    Let's start to use this flag first to clean up the only check_mapping variable.
    Firstly, the name "check_mapping" implies this is a "boolean", but actually it
    stores the mapping inside, just in a way that it won't be set if we don't want
    to check the mapping.
    
    To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
    that we only check against the mapping if this bit set.  At the same time, we
    can rename check_mapping into zap_mapping and set it always.
    
    Since at it, introduce another helper zap_check_mapping_skip() and use it in
    zap_pte_range() properly.
    
    Some old comments have been removed in zap_pte_range() because they're
    duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
    easy to grep this information by simply grepping the flag.
    
    It'll also make life easier when we want to e.g. pass in zap_flags into the
    callers like unmap_mapping_pages() (instead of adding new booleans besides the
    even_cows parameter).
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  20. mm: Drop first_index/last_index in zap_details

    The first_index/last_index parameters in zap_details are actually only used in
    unmap_mapping_range_tree().  At the meantime, this function is only called by
    unmap_mapping_pages() once.  Instead of passing these two variables through the
    whole stack of page zapping code, remove them from zap_details and let them
    simply be parameters of unmap_mapping_range_tree(), which is inlined.
    
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  21. shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

    File-backed memories are prone to unmap/swap so the ptes are always unstable.
    This could lead to userfaultfd-wp information got lost when unmapped or swapped
    out on such types of memory, for example, shmem.  To keep such an information
    persistent, we will start to use the newly introduced swap-like special ptes to
    replace a null pte when those ptes were removed.
    
    Prepare this by handling such a special pte first before it is applied in the
    general page fault handler.
    
    The handling of this special pte page fault is similar to missing fault, but it
    should happen after the pte missing logic since the special pte is designed to
    be a swap-like pte.  Meanwhile it should be handled before do_swap_page() so
    that the swap core logic won't be confused to see such an illegal swap pte.
    
    This is a slow path of uffd-wp handling, because unmap of wr-protected shmem
    ptes should be rare.  So far it should only trigger in two conditions:
    
      (1) When trying to punch holes in shmem_fallocate(), there will be a
          pre-unmap optimization before evicting the page.  That will create
          unmapped shmem ptes with wr-protected pages covered.
    
      (2) Swapping out of shmem pages
    
    Because of this, the page fault handling is simplifed too by not sending the
    wr-protect message in the 1st page fault, instead the page will be installed
    read-only, so the message will be generated until the next write, which will
    trigger the do_wp_page() path of general uffd-wp handling.
    
    Disable fault-around for all uffd-wp registered ranges for extra safety, and
    clean the code up a bit after we introduced MINOR fault.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  22. mm/swap: Introduce the idea of special swap ptes

    We used to have special swap entries, like migration entries, hw-poison
    entries, device private entries, etc.
    
    Those "special swap entries" reside in the range that they need to be at least
    swap entries first, and their types are decided by swp_type(entry).
    
    This patch introduces another idea called "special swap ptes".
    
    It's very easy to get confused against "special swap entries", but a speical
    swap pte should never contain a swap entry at all.  It means, it's illegal to
    call pte_to_swp_entry() upon a special swap pte.
    
    Make the uffd-wp special pte to be the first special swap pte.
    
    Before this patch, is_swap_pte()==true means one of the below:
    
       (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
             example, when an anonymous page got swapped out.
    
       (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
             example, a migration entry, a hw-poison entry, etc.
    
    After this patch, is_swap_pte()==true means one of the below, where case (b) is
    added:
    
     (a) The pte contains a swap entry.
    
       (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
             example, when an anonymous page got swapped out.
    
       (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
             example, a migration entry, a hw-poison entry, etc.
    
     (b) The pte does not contain a swap entry at all (so it cannot be passed
         into pte_to_swp_entry()).  For example, uffd-wp special swap pte.
    
    Teach the whole mm core about this new idea.  It's done by introducing another
    helper called pte_has_swap_entry() which stands for case (a.1) and (a.2).
    Before this patch, it will be the same as is_swap_pte() because there's no
    special swap pte yet.  Now for most of the previous use of is_swap_entry() in
    mm core, we'll need to use the new helper pte_has_swap_entry() instead, to make
    sure we won't try to parse a swap entry from a swap special pte (which does not
    contain a swap entry at all!).  We either handle the swap special pte, or it'll
    naturally use the default "else" paths.
    
    Warn properly (e.g., in do_swap_page()) when we see a special swap pte - we
    should never call do_swap_page() upon those ptes, but just to bail out early if
    it happens.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  23. mm/userfaultfd: Introduce special pte for unmapped file-backed mem

    This patch introduces a very special swap-like pte for file-backed memories.
    
    Currently it's only defined for x86_64 only, but as long as any arch that can
    properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it should
    conceptually work too.
    
    We will use this special pte to arm the ptes that got either unmapped or
    swapped out for a file-backed region that was previously wr-protected.  This
    special pte could trigger a page fault just like swap entries, and as long as
    the page fault will satisfy pte_none()==false && pte_present()==false.
    
    Then we can revive the special pte into a normal pte backed by the page cache.
    
    This idea is greatly inspired by Hugh and Andrea in the discussion, which is
    referenced in the links below.
    
    The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as the
    special pte.  The current solution (as pointed out by Andrea) is slightly
    preferred in that we don't even need swp_entry_t knowledge at all in trapping
    these accesses.  Meanwhile, we also reuse _PAGE_SWP_UFFD_WP from the anonymous
    swp entries.
    
    This patch only introduces the special pte and its operators.  It's not yet
    applied to have any functional difference.
    
    Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/
    Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/
    Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
    Suggested-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  24. mm: Clear vmf->pte after pte_unmap_same() returns

    pte_unmap_same() will always unmap the pte pointer.  After the unmap, vmf->pte
    will not be valid any more.  We should clear it.
    
    It was safe only because no one is accessing vmf->pte after pte_unmap_same()
    returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
    where vmf->pte will in most cases be overwritten very soon.
    
    pte_unmap_same() will be used in other places in follow up patches, so that
    vmf->pte will not always be re-written.  This patch enables us to call
    functions like finish_fault() because that'll conditionally unmap the pte by
    checking vmf->pte first.  Or, alloc_set_pte() will make sure to allocate a new
    pte even after calling pte_unmap_same().
    
    Since we'll need to modify vmf->pte, directly pass in vmf into pte_unmap_same()
    and then we can also avoid the long parameter list.
    
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  25. shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

    Firstly, pass wp_copy into shmem_mfill_atomic_pte() through the stack.
    Then apply the UFFD_WP bit properly when the UFFDIO_COPY on shmem is with
    UFFDIO_COPY_MODE_WP, then wp_copy lands mfill_atomic_install_pte() which is
    newly introduced very recently.
    
    We need to make sure shmem_mfill_atomic_pte() will always set the dirty bit in
    pte even if UFFDIO_COPY_MODE_WP is set.  After the rework of minor fault series
    on shmem we need to slightly touch up the logic there, since uffd-wp needs to
    be applied even if writable==false previously (e.g., for shmem private mapping).
    
    Note: we must do pte_wrprotect() if !writable in mfill_atomic_install_pte(), as
    mk_pte() could return a writable pte (e.g., when VM_SHARED on a shmem file).
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  26. mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte

    It was conditionally done previously, as there's one shmem special case that we
    use SetPageDirty() instead.  However that's not necessary and it should be
    easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
    
    The most recent discussion about this is here, where Hugh explained the history
    of SetPageDirty() and why it's possible that it's not required at all:
    
    https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/
    
    Currently mfill_atomic_install_pte() has three callers:
    
            1. shmem_mfill_atomic_pte
            2. mcopy_atomic_pte
            3. mcontinue_atomic_pte
    
    After the change: case (1) should have its SetPageDirty replaced by the dirty
    bit on pte (so we unify them together, finally), case (2) should have no
    functional change at all as it has page_in_cache==false, case (3) may add a
    dirty bit to the pte.  However since case (3) is UFFDIO_CONTINUE for shmem,
    it's merely 100% sure the page is dirty after all, so should not make a real
    difference either.
    
    This should make it much easier to follow on which case will set dirty for
    uffd, as we'll simply set it all now for all uffd related ioctls.  Meanwhile,
    no special handling of SetPageDirty() if there's no need.
    
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    xzpeter authored and intel-lab-lkp committed Jul 14, 2021
  27. Merge tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel…

    …/git/netdev/net
    
    Pull networking fixes from Jakub Kicinski.
     "Including fixes from bpf and netfilter.
    
      Current release - regressions:
    
       - sock: fix parameter order in sock_setsockopt()
    
      Current release - new code bugs:
    
       - netfilter: nft_last:
           - fix incorrect arithmetic when restoring last used
           - honor NFTA_LAST_SET on restoration
    
      Previous releases - regressions:
    
       - udp: properly flush normal packet at GRO time
    
       - sfc: ensure correct number of XDP queues; don't allow enabling the
         feature if there isn't sufficient resources to Tx from any CPU
    
       - dsa: sja1105: fix address learning getting disabled on the CPU port
    
       - mptcp: addresses a rmem accounting issue that could keep packets in
         subflow receive buffers longer than necessary, delaying MPTCP-level
         ACKs
    
       - ip_tunnel: fix mtu calculation for ETHER tunnel devices
    
       - do not reuse skbs allocated from skbuff_fclone_cache in the napi
         skb cache, we'd try to return them to the wrong slab cache
    
       - tcp: consistently disable header prediction for mptcp
    
      Previous releases - always broken:
    
       - bpf: fix subprog poke descriptor tracking use-after-free
    
       - ipv6:
           - allocate enough headroom in ip6_finish_output2() in case
             iptables TEE is used
           - tcp: drop silly ICMPv6 packet too big messages to avoid
             expensive and pointless lookups (which may serve as a DDOS
             vector)
           - make sure fwmark is copied in SYNACK packets
           - fix 'disable_policy' for forwarded packets (align with IPv4)
    
       - netfilter: conntrack:
           - do not renew entry stuck in tcp SYN_SENT state
           - do not mark RST in the reply direction coming after SYN packet
             for an out-of-sync entry
    
       - mptcp: cleanly handle error conditions with MP_JOIN and syncookies
    
       - mptcp: fix double free when rejecting a join due to port mismatch
    
       - validate lwtstate->data before returning from skb_tunnel_info()
    
       - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
    
       - mt76: mt7921: continue to probe driver when fw already downloaded
    
       - bonding: fix multiple issues with offloading IPsec to (thru?) bond
    
       - stmmac: ptp: fix issues around Qbv support and setting time back
    
       - bcmgenet: always clear wake-up based on energy detection
    
      Misc:
    
       - sctp: move 198 addresses from unusable to private scope
    
       - ptp: support virtual clocks and timestamping
    
       - openvswitch: optimize operation for key comparison"
    
    * tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
      net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
      sfc: add logs explaining XDP_TX/REDIRECT is not available
      sfc: ensure correct number of XDP queues
      sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
      net: fddi: fix UAF in fza_probe
      net: dsa: sja1105: fix address learning getting disabled on the CPU port
      net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
      net: Use nlmsg_unicast() instead of netlink_unicast()
      octeontx2-pf: Fix uninitialized boolean variable pps
      ipv6: allocate enough headroom in ip6_finish_output2()
      net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
      net: bridge: multicast: fix MRD advertisement router port marking race
      net: bridge: multicast: fix PIM hello router port marking race
      net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
      dsa: fix for_each_child.cocci warnings
      virtio_net: check virtqueue_add_sgs() return value
      mptcp: properly account bulk freed memory
      selftests: mptcp: fix case multiple subflows limited by server
      mptcp: avoid processing packet if a subflow reset
      mptcp: fix syncookie process if mptcp can not_accept new subflow
      ...
    torvalds committed Jul 14, 2021
  28. fs: add vfs_parse_fs_param_source() helper

    Add a simple helper that filesystems can use in their parameter parser
    to parse the "source" parameter. A few places open-coded this function
    and that already caused a bug in the cgroup v1 parser that we fixed.
    Let's make it harder to get this wrong by introducing a helper which
    performs all necessary checks.
    
    Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    brauner authored and torvalds committed Jul 14, 2021
  29. cgroup: verify that source is a string

    The following sequence can be used to trigger a UAF:
    
        int fscontext_fd = fsopen("cgroup");
        int fd_null = open("/dev/null, O_RDONLY);
        int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
        close_range(3, ~0U, 0);
    
    The cgroup v1 specific fs parser expects a string for the "source"
    parameter.  However, it is perfectly legitimate to e.g.  specify a file
    descriptor for the "source" parameter.  The fs parser doesn't know what
    a filesystem allows there.  So it's a bug to assume that "source" is
    always of type fs_value_is_string when it can reasonably also be
    fs_value_is_file.
    
    This assumption in the cgroup code causes a UAF because struct
    fs_parameter uses a union for the actual value.  Access to that union is
    guarded by the param->type member.  Since the cgroup paramter parser
    didn't check param->type but unconditionally moved param->string into
    fc->source a close on the fscontext_fd would trigger a UAF during
    put_fs_context() which frees fc->source thereby freeing the file stashed
    in param->file causing a UAF during a close of the fd_null.
    
    Fix this by verifying that param->type is actually a string and report
    an error if not.
    
    In follow up patches I'll add a new generic helper that can be used here
    and by other filesystems instead of this error-prone copy-pasta fix.
    But fixing it in here first makes backporting a it to stable a lot
    easier.
    
    Fixes: 8d2451f ("cgroup1: switch to option-by-option parsing")
    Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: <stable@kernel.org>
    Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    brauner authored and torvalds committed Jul 14, 2021

Commits on Jul 13, 2021

  1. net: dsa: properly check for the bridge_leave methods in dsa_switch_b…

    …ridge_leave()
    
    This was not caught because there is no switch driver which implements
    the .port_bridge_join but not .port_bridge_leave method, but it should
    nonetheless be fixed, as in certain conditions (driver development) it
    might lead to NULL pointer dereference.
    
    Fixes: f66a6a6 ("net: dsa: permit cross-chip bridging between all trees in the system")
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    vladimiroltean authored and davem330 committed Jul 13, 2021
  2. Merge tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kern…

    …el/git/hansg/linux
    
    Pull vboxsf fixes from Hans de Goede:
     "This adds support for the atomic_open directory-inode op to vboxsf.
    
      Note this is not just an enhancement this also fixes an actual issue
      which users are hitting, see the commit message of the "boxsf: Add
      support for the atomic_open directory-inode" patch"
    
    * tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux:
      vboxsf: Add support for the atomic_open directory-inode op
      vboxsf: Add vboxsf_[create|release]_sf_handle() helpers
      vboxsf: Make vboxsf_dir_create() return the handle for the created file
      vboxsf: Honor excl flag to the dir-inode create op
    torvalds committed Jul 13, 2021
  3. Merge tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/ke…

    …rnel/git/kdave/linux
    
    Pull btrfs zoned mode fixes from David Sterba:
    
     - fix deadlock when allocating system chunk
    
     - fix wrong mutex unlock on an error path
    
     - fix extent map splitting for append operation
    
     - update and fix message reporting unusable chunk space
    
     - don't block when background zone reclaim runs with balance in
       parallel
    
    * tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
      btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
      btrfs: don't block if we can't acquire the reclaim lock
      btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
      btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
      btrfs: fix deadlock with concurrent chunk allocations involving system chunks
      btrfs: zoned: print unusable percentage when reclaiming block groups
      btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
    torvalds committed Jul 13, 2021
  4. Merge branch 'sfc-tx-queues'

    Íñigo Huguet says:
    
    ====================
    sfc: Fix lack of XDP TX queues
    
    A change introduced in commit e26ca4b ("sfc: reduce the number of
    requested xdp ev queues") created a bug in XDP_TX and XDP_REDIRECT
    because it unintentionally reduced the number of XDP TX queues, letting
    not enough queues to have one per CPU, which leaded to errors if XDP
    TX/REDIRECT was done from a high numbered CPU.
    
    This patchs make the following changes:
    - Fix the bug mentioned above
    - Revert commit 99ba0ea ("sfc: adjust efx->xdp_tx_queue_count with
      the real number of initialized queues") which intended to fix a related
      problem, created by mentioned bug, but it's no longer necessary
    - Add a new error log message if there are not enough resources to make
      XDP_TX/REDIRECT work
    
    V1 -> V2: keep the calculation of how many tx queues can handle a single
    event queue, but apply the "max. tx queues per channel" upper limit.
    V2 -> V3: WARN_ON if the number of initialized XDP TXQs differs from the
    expected.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Jul 13, 2021
  5. sfc: add logs explaining XDP_TX/REDIRECT is not available

    If it's not possible to allocate enough channels for XDP, XDP_TX and
    XDP_REDIRECT don't work. However, only a message saying that not enough
    channels were available was shown, but not saying what are the
    consequences in that case. The user didn't know if he/she can use XDP
    or not, if the performance is reduced, or what.
    
    Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Íñigo Huguet authored and davem330 committed Jul 13, 2021
  6. sfc: ensure correct number of XDP queues

    Commit 99ba0ea ("sfc: adjust efx->xdp_tx_queue_count with the real
    number of initialized queues") intended to fix a problem caused by a
    round up when calculating the number of XDP channels and queues.
    However, this was not the real problem. The real problem was that the
    number of XDP TX queues had been reduced to half in
    commit e26ca4b ("sfc: reduce the number of requested xdp ev queues"),
    but the variable xdp_tx_queue_count had remained the same.
    
    Once the correct number of XDP TX queues is created again in the
    previous patch of this series, this also can be reverted since the error
    doesn't actually exist.
    
    Only in the case that there is a bug in the code we can have different
    values in xdp_queue_number and efx->xdp_tx_queue_count. Because of this,
    and per Edward Cree's suggestion, I add instead a WARN_ON to catch if it
    happens again in the future.
    
    Note that the number of allocated queues can be higher than the number
    of used ones due to the round up, as explained in the existing comment
    in the code. That's why we also have to stop increasing xdp_queue_number
    beyond efx->xdp_tx_queue_count.
    
    Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Íñigo Huguet authored and davem330 committed Jul 13, 2021
Older