From a5eeb8d197a8e10c333422e9cc0f2c7d976a3426 Mon Sep 17 00:00:00 2001 From: Alexey Avramov Date: Tue, 6 Apr 2021 06:59:44 +0900 Subject: [PATCH] mm/swap: fix system stuck due to infinite loop MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit > In the case of high system memory and load pressure, we ran ltp test > and found that the system was stuck, the direct memory reclaim was > all stuck in io_schedule > For the first time involving the swap part, there is no good way to fix > the problem The solution is protecting the clean file pages. Look at this: > On ChromiumOS, we do not use swap. When memory is low, the only > way to free memory is to reclaim pages from the file list. This > results in a lot of thrashing under low memory conditions. We see > the system become unresponsive for minutes before it eventually OOMs. > We also see very slow browser tab switching under low memory. Instead > of an unresponsive system, we'd really like the kernel to OOM as soon > as it starts to thrash. If it can't keep the working set in memory, > then OOM. Losing one of many tabs is a better behaviour for the user > than an unresponsive system. > This patch create a new sysctl, min_filelist_kbytes, which disables > reclaim of file-backed pages when when there are less than min_filelist_bytes > worth of such pages in the cache. This tunable is handy for low memory > systems using solid-state storage where interactive response is more important > than not OOMing. > With this patch and min_filelist_kbytes set to 50000, I see very little block > layer activity during low memory. The system stays responsive under low > memory and browser tab switching is fast. Eventually, a process a gets killed > by OOM. Without this patch, the system gets wedged for minutes before it > eventually OOMs. — https://lore.kernel.org/patchwork/patch/222042/ This patch can almost completely eliminate thrashing under memory pressure. Effects - Improving system responsiveness under low-memory conditions; - Improving performans in I/O bound tasks under memory pressure; - OOM killer comes faster (with hard protection); - Fast system reclaiming after OOM. Read more: https://github.com/hakavlad/le9-patch The patch: From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001 From: Alexey Avramov Date: Mon, 5 Apr 2021 01:53:26 +0900 Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified amount of clean file cache The kernel does not have a mechanism for targeted protection of clean file pages (CFP). A certain amount of the CFP is required by the userspace for normal operation. First of all, you need a cache of shared libraries and executable files. If the volume of the CFP cache falls below a certain level, thrashing and even livelock occurs. Protection of CFP may be used to prevent thrashing and reducing I/O under memory pressure. Hard protection of CFP may be used to avoid high latency and prevent livelock in near-OOM conditions. The patch provides sysctl knobs for protecting the specified amount of clean file cache under memory pressure. The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of CFP. The CFP on the current node won't be reclaimed uder memory pressure when their volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have no swap space or vm.swappiness=0. Setting it to a high value may result in a early eviction of anonymous pages into the swap space by attempting to hold the protected amount of clean file pages in memory. The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in Kconfig). The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. The CFP on the current node won't be reclaimed under memory pressure when their volume is below vm.clean_min_kbytes. Setting it to a high value may result in a early out-of-memory condition due to the inability to reclaim the protected amount of CFP when other types of pages cannot be reclaimed. The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in Kconfig). Reported-by: Artem S. Tashkinov Signed-off-by: Alexey Avramov --- Documentation/admin-guide/sysctl/vm.rst | 37 ++++++++++++++++ include/linux/mm.h | 3 ++ kernel/sysctl.c | 14 ++++++ mm/Kconfig | 35 +++++++++++++++ mm/vmscan.c | 59 +++++++++++++++++++++++++ 5 files changed, 148 insertions(+) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 586cd4b8642842..01187bf01b8e96 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm: - admin_reserve_kbytes - block_dump +- clean_low_kbytes +- clean_min_kbytes - compact_memory - compaction_proactiveness - compact_unevictable_allowed @@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a nonzero value. More information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst. +clean_low_kbytes +===================== + +This knob provides *best-effort* protection of clean file pages. The clean file +pages on the current node won't be reclaimed uder memory pressure when their +volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have no +swap space or vm.swappiness=0. + +Protection of clean file pages may be used to prevent thrashing and +reducing I/O under low-memory conditions. + +Setting it to a high value may result in a early eviction of anonymous pages +into the swap space by attempting to hold the protected amount of clean file +pages in memory. + +The default value is defined by CONFIG_CLEAN_LOW_KBYTES. + + +clean_min_kbytes +===================== + +This knob provides *hard* protection of clean file pages. The clean file pages +on the current node won't be reclaimed under memory pressure when their volume +is below vm.clean_min_kbytes. + +Hard protection of clean file pages may be used to avoid high latency and +prevent livelock in near-OOM conditions. + +Setting it to a high value may result in a early out-of-memory condition due to +the inability to reclaim the protected amount of clean file pages when other +types of pages cannot be reclaimed. + +The default value is defined by CONFIG_CLEAN_MIN_KBYTES. + + compact_memory ============== diff --git a/include/linux/mm.h b/include/linux/mm.h index 8ba434287387b7..383feb1cfd6761 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -203,6 +203,9 @@ static inline void __mm_zero_struct_page(struct page *page) extern int sysctl_max_map_count; +extern unsigned long sysctl_clean_low_kbytes; +extern unsigned long sysctl_clean_min_kbytes; + extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 62fbd09b5dc1c0..215317d018018c 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -3093,6 +3093,20 @@ static struct ctl_table vm_table[] = { .extra2 = SYSCTL_ONE, }, #endif + { + .procname = "clean_low_kbytes", + .data = &sysctl_clean_low_kbytes, + .maxlen = sizeof(sysctl_clean_low_kbytes), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + }, + { + .procname = "clean_min_kbytes", + .data = &sysctl_clean_min_kbytes, + .maxlen = sizeof(sysctl_clean_min_kbytes), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + }, { .procname = "user_reserve_kbytes", .data = &sysctl_user_reserve_kbytes, diff --git a/mm/Kconfig b/mm/Kconfig index 24c045b24b9506..36349833afc630 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -122,6 +122,41 @@ config SPARSEMEM_VMEMMAP pfn_to_page and page_to_pfn operations. This is the most efficient option when sufficient kernel resources are available. +config CLEAN_LOW_KBYTES + int "Default value for vm.clean_low_kbytes" + depends on SYSCTL + default "0" + help + The vm.clean_file_low_kbytes sysctl knob provides *best-effort* + protection of clean file pages. The clean file pages on the current + node won't be reclaimed uder memory pressure when their volume is + below vm.clean_low_kbytes *unless* we threaten to OOM or have + no swap space or vm.swappiness=0. + + Protection of clean file pages may be used to prevent thrashing and + reducing I/O under low-memory conditions. + + Setting it to a high value may result in a early eviction of anonymous + pages into the swap space by attempting to hold the protected amount of + clean file pages in memory. + +config CLEAN_MIN_KBYTES + int "Default value for vm.clean_min_kbytes" + depends on SYSCTL + default "0" + help + The vm.clean_file_min_kbytes sysctl knob provides *hard* protection + of clean file pages. The clean file pages on the current node won't be + reclaimed under memory pressure when their volume is below + vm.clean_min_kbytes. + + Hard protection of clean file pages may be used to avoid high latency and + prevent livelock in near-OOM conditions. + + Setting it to a high value may result in a early out-of-memory condition + due to the inability to reclaim the protected amount of clean file pages + when other types of pages cannot be reclaimed. + config HAVE_MEMBLOCK_PHYS_MAP bool diff --git a/mm/vmscan.c b/mm/vmscan.c index 562e87cbd7a1ab..470381e326c15c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -118,6 +118,19 @@ struct scan_control { /* The file pages on the current node are dangerously low */ unsigned int file_is_tiny:1; + /* + * The clean file pages on the current node won't be reclaimed when + * their volume is below vm.clean_low_kbytes *unless* we threaten + * to OOM or have no swap space or vm.swappiness=0. + */ + unsigned int clean_below_low:1; + + /* + * The clean file pages on the current node won't be reclaimed when + * their volume is below vm.clean_min_kbytes. + */ + unsigned int clean_below_min:1; + /* Allocation order */ s8 order; @@ -164,6 +177,17 @@ struct scan_control { #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0) #endif +#if CONFIG_CLEAN_LOW_KBYTES < 0 +#error "CONFIG_CLEAN_LOW_KBYTES must be >= 0" +#endif + +#if CONFIG_CLEAN_MIN_KBYTES < 0 +#error "CONFIG_CLEAN_MIN_KBYTES must be >= 0" +#endif + +unsigned long sysctl_clean_low_kbytes __read_mostly = CONFIG_CLEAN_LOW_KBYTES; +unsigned long sysctl_clean_min_kbytes __read_mostly = CONFIG_CLEAN_MIN_KBYTES; + /* * From 0 .. 200. Higher means more swappy. */ @@ -2281,6 +2305,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, goto out; } + /* + * Force-scan anon if clean file pages is under vm.clean_min_kbytes + * or vm.clean_low_kbytes (unless the swappiness setting + * disagrees with swapping). + */ + if ((sc->clean_below_low || sc->clean_below_min) && swappiness) { + scan_balance = SCAN_ANON; + goto out; + } + /* * If there is enough inactive page cache, we do not reclaim * anything from the anonymous working right now. @@ -2417,6 +2451,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, BUG(); } + /* + * Don't reclaim clean file pages when their volume is below + * vm.clean_min_kbytes. + */ + if (file && sc->clean_below_min) + scan = 0; + nr[lru] = scan; } } @@ -2767,6 +2808,24 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) anon >> sc->priority; } + if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) { + unsigned long reclaimable_file, dirty, clean; + + reclaimable_file = + node_page_state(pgdat, NR_ACTIVE_FILE) + + node_page_state(pgdat, NR_INACTIVE_FILE) + + node_page_state(pgdat, NR_ISOLATED_FILE); + dirty = node_page_state(pgdat, NR_FILE_DIRTY); + if (reclaimable_file > dirty) + clean = (reclaimable_file - dirty) << (PAGE_SHIFT - 10); + + sc->clean_below_low = clean < sysctl_clean_low_kbytes; + sc->clean_below_min = clean < sysctl_clean_min_kbytes; + } else { + sc->clean_below_low = false; + sc->clean_below_min = false; + } + shrink_node_memcgs(pgdat, sc); if (reclaim_state) {