Skip to content

Commit

Permalink
mm: Device exclusive memory access
Browse files Browse the repository at this point in the history
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
  • Loading branch information
apopple-nvidia authored and intel-lab-lkp committed Jun 7, 2021
1 parent 9727d64 commit e541984
Show file tree
Hide file tree
Showing 10 changed files with 401 additions and 10 deletions.
17 changes: 17 additions & 0 deletions Documentation/vm/hmm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,23 @@ between device driver specific code and shared common code:

The lock can now be released.

Exclusive access memory
=======================

Some devices have features such as atomic PTE bits that can be used to implement
atomic access to system memory. To support atomic operations to a shared virtual
memory page such a device needs access to that page which is exclusive of any
userspace access from the CPU. The ``make_device_exclusive_range()`` function
can be used to make a memory range inaccessible from userspace.

This replaces all mappings for pages in the given range with special swap
entries. Any attempt to access the swap entry results in a fault which is
resovled by replacing the entry with the original mapping. A driver gets
notified that the mapping has been changed by MMU notifiers, after which point
it will no longer have exclusive access to the page. Exclusive access is
guranteed to last until the driver drops the page lock and page reference, at
which point any CPU faults on the page may proceed as described.

Memory cgroup (memcg) and rss accounting
========================================

Expand Down
6 changes: 6 additions & 0 deletions include/linux/mmu_notifier.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ struct mmu_interval_notifier;
* @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
* a device driver to possibly ignore the invalidation if the
* owner field matches the driver's device private pgmap owner.
*
* @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
* longer have exclusive access to the page. When sent during creation of an
* exclusive range the owner will be initialised to the value provided by the
* caller of make_device_exclusive_range(), otherwise the owner will be NULL.
*/
enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
Expand All @@ -51,6 +56,7 @@ enum mmu_notifier_event {
MMU_NOTIFY_SOFT_DIRTY,
MMU_NOTIFY_RELEASE,
MMU_NOTIFY_MIGRATE,
MMU_NOTIFY_EXCLUSIVE,
};

#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
Expand Down
4 changes: 4 additions & 0 deletions include/linux/rmap.h
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked,
bool try_to_migrate(struct page *page, enum ttu_flags flags);
bool try_to_unmap(struct page *, enum ttu_flags flags);

int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
unsigned long end, struct page **pages,
void *arg);

/* Avoid racy checks */
#define PVMW_SYNC (1 << 0)
/* Look for migarion entries rather than present PTEs */
Expand Down
9 changes: 7 additions & 2 deletions include/linux/swap.h
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,17 @@ static inline int current_is_kswapd(void)
* migrate part of a process memory to device memory.
*
* When a page is migrated from CPU to device, we set the CPU page table entry
* to a special SWP_DEVICE_* entry.
* to a special SWP_DEVICE_{READ|WRITE} entry.
*
* When a page is mapped by the device for exclusive access we set the CPU page
* table entries to special SWP_DEVICE_EXCLUSIVE_* entries.
*/
#ifdef CONFIG_DEVICE_PRIVATE
#define SWP_DEVICE_NUM 2
#define SWP_DEVICE_NUM 4
#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
#else
#define SWP_DEVICE_NUM 0
#endif
Expand Down
44 changes: 43 additions & 1 deletion include/linux/swapops.h
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,27 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
}

static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
{
return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
}

static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
{
return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
}

static inline bool is_device_exclusive_entry(swp_entry_t entry)
{
return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
}

static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
}
#else /* CONFIG_DEVICE_PRIVATE */
static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
{
Expand All @@ -140,6 +161,26 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
{
return false;
}

static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
{
return swp_entry(0, 0);
}

static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
{
return swp_entry(0, 0);
}

static inline bool is_device_exclusive_entry(swp_entry_t entry)
{
return false;
}

static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
{
return false;
}
#endif /* CONFIG_DEVICE_PRIVATE */

#ifdef CONFIG_MIGRATION
Expand Down Expand Up @@ -219,7 +260,8 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
*/
static inline bool is_pfn_swap_entry(swp_entry_t entry)
{
return is_migration_entry(entry) || is_device_private_entry(entry);
return is_migration_entry(entry) || is_device_private_entry(entry) ||
is_device_exclusive_entry(entry);
}

struct page_vma_mapped_walk;
Expand Down
5 changes: 5 additions & 0 deletions mm/hmm.c
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
#include <linux/mmu_notifier.h>
#include <linux/memory_hotplug.h>

#include "internal.h"

struct hmm_vma_walk {
struct hmm_range *range;
unsigned long last;
Expand Down Expand Up @@ -271,6 +273,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
if (!non_swap_entry(entry))
goto fault;

if (is_device_exclusive_entry(entry))
goto fault;

if (is_migration_entry(entry)) {
pte_unmap(ptep);
hmm_vma_walk->last = addr;
Expand Down
127 changes: 123 additions & 4 deletions mm/memory.c
Original file line number Diff line number Diff line change
Expand Up @@ -700,6 +700,68 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
}
#endif

static void restore_exclusive_pte(struct vm_area_struct *vma,
struct page *page, unsigned long address,
pte_t *ptep)
{
pte_t pte;
swp_entry_t entry;

pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
if (pte_swp_soft_dirty(*ptep))
pte = pte_mksoft_dirty(pte);

entry = pte_to_swp_entry(*ptep);
if (pte_swp_uffd_wp(*ptep))
pte = pte_mkuffd_wp(pte);
else if (is_writable_device_exclusive_entry(entry))
pte = maybe_mkwrite(pte_mkdirty(pte), vma);

set_pte_at(vma->vm_mm, address, ptep, pte);

/*
* No need to take a page reference as one was already
* created when the swap entry was made.
*/
if (PageAnon(page))
page_add_anon_rmap(page, vma, address, false);
else
/*
* Currently device exclusive access only supports anonymous
* memory so the entry shouldn't point to a filebacked page.
*/
WARN_ON_ONCE(!PageAnon(page));

if (vma->vm_flags & VM_LOCKED)
mlock_vma_page(page);

/*
* No need to invalidate - it was non-present before. However
* secondary CPUs may have mappings that need invalidating.
*/
update_mmu_cache(vma, address, ptep);
}

/*
* Tries to restore an exclusive pte if the page lock can be acquired without
* sleeping.
*/
static int
try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma,
unsigned long addr)
{
swp_entry_t entry = pte_to_swp_entry(*src_pte);
struct page *page = pfn_swap_entry_to_page(entry);

if (trylock_page(page)) {
restore_exclusive_pte(vma, page, addr, src_pte);
unlock_page(page);
return 0;
}

return -EBUSY;
}

/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
Expand Down Expand Up @@ -781,6 +843,17 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte = pte_swp_mkuffd_wp(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
} else if (is_device_exclusive_entry(entry)) {
/*
* Make device exclusive entries present by restoring the
* original entry then copying as for a present pte. Device
* exclusive entries currently only support private writable
* (ie. COW) mappings.
*/
VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
if (try_restore_exclusive_pte(src_pte, vma, addr))
return -EBUSY;
return -ENOENT;
}
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
Expand Down Expand Up @@ -980,9 +1053,18 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
if (ret == -EIO) {
entry = pte_to_swp_entry(*src_pte);
break;
} else if (ret == -EBUSY) {
break;
} else if (!ret) {
progress += 8;
continue;
}
progress += 8;
continue;

/*
* Device exclusive entry restored, continue by copying
* the now present pte.
*/
WARN_ON_ONCE(ret != -ENOENT);
}
/* copy_present_pte() will clear `*prealloc' if consumed */
ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
Expand Down Expand Up @@ -1020,6 +1102,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
goto out;
}
entry.val = 0;
} else if (ret == -EBUSY) {
goto out;
} else if (ret == -EAGAIN) {
prealloc = page_copy_prealloc(src_mm, src_vma, addr);
if (!prealloc)
Expand Down Expand Up @@ -1287,7 +1371,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}

entry = pte_to_swp_entry(ptent);
if (is_device_private_entry(entry)) {
if (is_device_private_entry(entry) ||
is_device_exclusive_entry(entry)) {
struct page *page = pfn_swap_entry_to_page(entry);

if (unlikely(details && details->check_mapping)) {
Expand All @@ -1303,7 +1388,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,

pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
page_remove_rmap(page, false);

if (is_device_private_entry(entry))
page_remove_rmap(page, false);

put_page(page);
continue;
}
Expand Down Expand Up @@ -3307,6 +3395,34 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);

/*
* Restore a potential device exclusive pte to a working pte entry
*/
static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
{
struct page *page = vmf->page;
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;

if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
return VM_FAULT_RETRY;
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
vma->vm_mm, vmf->address & PAGE_MASK,
(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
mmu_notifier_invalidate_range_start(&range);

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
restore_exclusive_pte(vma, page, vmf->address, vmf->pte);

pte_unmap_unlock(vmf->pte, vmf->ptl);
unlock_page(page);

mmu_notifier_invalidate_range_end(&range);
return 0;
}

/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
Expand Down Expand Up @@ -3334,6 +3450,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (is_migration_entry(entry)) {
migration_entry_wait(vma->vm_mm, vmf->pmd,
vmf->address);
} else if (is_device_exclusive_entry(entry)) {
vmf->page = pfn_swap_entry_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
} else if (is_device_private_entry(entry)) {
vmf->page = pfn_swap_entry_to_page(entry);
ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
Expand Down
8 changes: 8 additions & 0 deletions mm/mprotect.c
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
newpte = swp_entry_to_pte(entry);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
} else if (is_writable_device_exclusive_entry(entry)) {
entry = make_readable_device_exclusive_entry(
swp_offset(entry));
newpte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(oldpte))
newpte = pte_swp_mksoft_dirty(newpte);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
} else {
newpte = oldpte;
}
Expand Down
9 changes: 6 additions & 3 deletions mm/page_vma_mapped.c
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)

/* Handle un-addressable ZONE_DEVICE memory */
entry = pte_to_swp_entry(*pvmw->pte);
if (!is_device_private_entry(entry))
if (!is_device_private_entry(entry) &&
!is_device_exclusive_entry(entry))
return false;
} else if (!pte_present(*pvmw->pte))
return false;
Expand Down Expand Up @@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
return false;
entry = pte_to_swp_entry(*pvmw->pte);

if (!is_migration_entry(entry))
if (!is_migration_entry(entry) &&
!is_device_exclusive_entry(entry))
return false;

pfn = swp_offset(entry);
Expand All @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

/* Handle un-addressable ZONE_DEVICE memory */
entry = pte_to_swp_entry(*pvmw->pte);
if (!is_device_private_entry(entry))
if (!is_device_private_entry(entry) &&
!is_device_exclusive_entry(entry))
return false;

pfn = swp_offset(entry);
Expand Down

0 comments on commit e541984

Please sign in to comment.