Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
mm: introduce refcount for user PTE page table page
1. Preface
==========
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
physical memory for the following reasons::
First of all, we should hold as few write locks of mmap_lock as possible,
since the mmap_lock semaphore has long been a contention point in the
memory management subsystem. The mmap()/munmap() hold the write lock, and
the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
madvise() instead of munmap() to released physical memory can reduce the
competition of the mmap_lock.
Secondly, after using madvise() to release physical memory, there is no
need to build vma and allocate page tables again when accessing the same
virtual address again, which can also save some time.
The following is the largest user PTE page table memory that can be
allocated by a single user process in a 32-bit and a 64-bit system.
+---------------------------+--------+---------+
| | 32-bit | 64-bit |
+===========================+========+=========+
| user PTE page table pages | 3 MiB | 512 GiB |
+---------------------------+--------+---------+
| user PMD page table pages | 3 KiB | 1 GiB |
+---------------------------+--------+---------+
(for 32-bit, take 3G user address space, 4K page size as an example;
for 64-bit, take 48-bit address width, 4K page size as an example.)
After using madvise(), everything looks good, but as can be seen from the
above table, a single process can create a large number of PTE page tables
on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
release page table memory. And before the process exits or calls munmap(),
the kernel cannot reclaim these pages even if these PTE page tables do not
map anything.
Therefore, we decided to introduce reference count to manage the PTE page
table life cycle, so that some free PTE page table memory in the system
can be dynamically released.
2. The reference count of user PTE page table pages
===================================================
We introduce two members for the struct page of the user PTE page table
page::
union {
pgtable_t pmd_huge_pte; /* protected by page->ptl */
pmd_t *pmd; /* PTE page only */
};
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
atomic_t pte_refcount; /* PTE page only */
};
The pmd member record the pmd entry that maps the user PTE page table page,
the pte_refcount member keep track of how many references to the user PTE
page table page.
The following people will hold a reference on the user PTE page table
page::
The !pte_none() entry, such as regular page table entry that map physical
pages, or swap entry, or migrate entry, etc.
Visitor to the PTE page table entries, such as page table walker.
Any ``!pte_none()`` entry and visitor can be regarded as the user of its
PTE page table page. When the ``pte_refcount`` is reduced to 0, it means
that no one is using the PTE page table page, then this free PTE page
table page can be released back to the system at this time.
3. Helpers
==========
+---------------------+-------------------------------------------------+
| pte_ref_init | Initialize the pte_refcount and pmd |
+---------------------+-------------------------------------------------+
| pte_to_pmd | Get the corresponding pmd |
+---------------------+-------------------------------------------------+
| pte_update_pmd | Update the corresponding pmd |
+---------------------+-------------------------------------------------+
| pte_get | Increment a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_get_many | Add a value to a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_get_unless_zero | Increment a pte_refcount unless it is 0 |
+---------------------+-------------------------------------------------+
| pte_try_get | Try to increment a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_tryget_map | Try to increment a pte_refcount before |
| | pte_offset_map() |
+---------------------+-------------------------------------------------+
| pte_tryget_map_lock | Try to increment a pte_refcount before |
| | pte_offset_map_lock() |
+---------------------+-------------------------------------------------+
| pte_put | Decrement a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_put_many | Sub a value to a pte_refcount |
+---------------------+-------------------------------------------------+
| pte_put_vmf | Decrement a pte_refcount in the page fault path |
+---------------------+-------------------------------------------------+
4. About this commit
====================
This commit just introduces some dummy helpers, the actual logic will
be implemented in future commits.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>- Loading branch information