Skip to content

Commit

Permalink
mm/swap: fix system stuck due to infinite loop
Browse files Browse the repository at this point in the history
> In the case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule

> For the first time involving the swap part, there is no good way to fix
> the problem

The solution is protecting the clean file pages.

Look at this:

> On ChromiumOS, we do not use swap. When memory is low, the only
> way to free memory is to reclaim pages from the file list. This
> results in a lot of thrashing under low memory conditions. We see
> the system become unresponsive for minutes before it eventually OOMs.
> We also see very slow browser tab switching under low memory. Instead
> of an unresponsive system, we'd really like the kernel to OOM as soon
> as it starts to thrash. If it can't keep the working set in memory,
> then OOM. Losing one of many tabs is a better behaviour for the user
> than an unresponsive system.

> This patch create a new sysctl, min_filelist_kbytes, which disables
> reclaim of file-backed pages when when there are less than min_filelist_bytes
> worth of such pages in the cache. This tunable is handy for low memory
> systems using solid-state storage where interactive response is more important
> than not OOMing.

> With this patch and min_filelist_kbytes set to 50000, I see very little block
> layer activity during low memory. The system stays responsive under low
> memory and browser tab switching is fast. Eventually, a process a gets killed
> by OOM. Without this patch, the system gets wedged for minutes before it
> eventually OOMs.

— https://lore.kernel.org/patchwork/patch/222042/

This patch can almost completely eliminate thrashing under memory pressure.

Effects
- Improving system responsiveness under low-memory conditions;
- Improving performans in I/O bound tasks under memory pressure;
- OOM killer comes faster (with hard protection);
- Fast system reclaiming after OOM.

Read more: https://github.com/hakavlad/le9-patch

The patch:

From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001
From: Alexey Avramov <hakavlad@inbox.lv>
Date: Mon, 5 Apr 2021 01:53:26 +0900
Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified
 amount of clean file cache

The kernel does not have a mechanism for targeted protection of clean
file pages (CFP). A certain amount of the CFP is required by the userspace
for normal operation. First of all, you need a cache of shared libraries
and executable files. If the volume of the CFP cache falls below a certain
level, thrashing and even livelock occurs.

Protection of CFP may be used to prevent thrashing and reducing I/O under
memory pressure. Hard protection of CFP may be used to avoid high latency
and prevent livelock in near-OOM conditions. The patch provides sysctl
knobs for protecting the specified amount of clean file cache under memory
pressure.

The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
CFP. The CFP on the current node won't be reclaimed uder memory pressure
when their volume is below vm.clean_low_kbytes *unless* we threaten to OOM
or have no swap space or vm.swappiness=0. Setting it to a high value may
result in a early eviction of anonymous pages into the swap space by
attempting to hold the protected amount of clean file pages in memory. The
default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
Kconfig).

The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. The
CFP on the current node won't be reclaimed under memory pressure when their
volume is below vm.clean_min_kbytes. Setting it to a high value may result
in a early out-of-memory condition due to the inability to reclaim the
protected amount of CFP when other types of pages cannot be reclaimed. The
default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
Kconfig).

Reported-by: Artem S. Tashkinov <aros@gmx.com>
Signed-off-by: Alexey Avramov <hakavlad@inbox.lv>
  • Loading branch information
Alexey Avramov authored and intel-lab-lkp committed Apr 5, 2021
1 parent 5e46d1b commit a5eeb8d
Show file tree
Hide file tree
Showing 5 changed files with 148 additions and 0 deletions.
37 changes: 37 additions & 0 deletions Documentation/admin-guide/sysctl/vm.rst
Expand Up @@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm:

- admin_reserve_kbytes
- block_dump
- clean_low_kbytes
- clean_min_kbytes
- compact_memory
- compaction_proactiveness
- compact_unevictable_allowed
Expand Down Expand Up @@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a nonzero value. More
information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.


clean_low_kbytes
=====================

This knob provides *best-effort* protection of clean file pages. The clean file
pages on the current node won't be reclaimed uder memory pressure when their
volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have no
swap space or vm.swappiness=0.

Protection of clean file pages may be used to prevent thrashing and
reducing I/O under low-memory conditions.

Setting it to a high value may result in a early eviction of anonymous pages
into the swap space by attempting to hold the protected amount of clean file
pages in memory.

The default value is defined by CONFIG_CLEAN_LOW_KBYTES.


clean_min_kbytes
=====================

This knob provides *hard* protection of clean file pages. The clean file pages
on the current node won't be reclaimed under memory pressure when their volume
is below vm.clean_min_kbytes.

Hard protection of clean file pages may be used to avoid high latency and
prevent livelock in near-OOM conditions.

Setting it to a high value may result in a early out-of-memory condition due to
the inability to reclaim the protected amount of clean file pages when other
types of pages cannot be reclaimed.

The default value is defined by CONFIG_CLEAN_MIN_KBYTES.


compact_memory
==============

Expand Down
3 changes: 3 additions & 0 deletions include/linux/mm.h
Expand Up @@ -203,6 +203,9 @@ static inline void __mm_zero_struct_page(struct page *page)

extern int sysctl_max_map_count;

extern unsigned long sysctl_clean_low_kbytes;
extern unsigned long sysctl_clean_min_kbytes;

extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;

Expand Down
14 changes: 14 additions & 0 deletions kernel/sysctl.c
Expand Up @@ -3093,6 +3093,20 @@ static struct ctl_table vm_table[] = {
.extra2 = SYSCTL_ONE,
},
#endif
{
.procname = "clean_low_kbytes",
.data = &sysctl_clean_low_kbytes,
.maxlen = sizeof(sysctl_clean_low_kbytes),
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
{
.procname = "clean_min_kbytes",
.data = &sysctl_clean_min_kbytes,
.maxlen = sizeof(sysctl_clean_min_kbytes),
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
{
.procname = "user_reserve_kbytes",
.data = &sysctl_user_reserve_kbytes,
Expand Down
35 changes: 35 additions & 0 deletions mm/Kconfig
Expand Up @@ -122,6 +122,41 @@ config SPARSEMEM_VMEMMAP
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.

config CLEAN_LOW_KBYTES
int "Default value for vm.clean_low_kbytes"
depends on SYSCTL
default "0"
help
The vm.clean_file_low_kbytes sysctl knob provides *best-effort*
protection of clean file pages. The clean file pages on the current
node won't be reclaimed uder memory pressure when their volume is
below vm.clean_low_kbytes *unless* we threaten to OOM or have
no swap space or vm.swappiness=0.

Protection of clean file pages may be used to prevent thrashing and
reducing I/O under low-memory conditions.

Setting it to a high value may result in a early eviction of anonymous
pages into the swap space by attempting to hold the protected amount of
clean file pages in memory.

config CLEAN_MIN_KBYTES
int "Default value for vm.clean_min_kbytes"
depends on SYSCTL
default "0"
help
The vm.clean_file_min_kbytes sysctl knob provides *hard* protection
of clean file pages. The clean file pages on the current node won't be
reclaimed under memory pressure when their volume is below
vm.clean_min_kbytes.

Hard protection of clean file pages may be used to avoid high latency and
prevent livelock in near-OOM conditions.

Setting it to a high value may result in a early out-of-memory condition
due to the inability to reclaim the protected amount of clean file pages
when other types of pages cannot be reclaimed.

config HAVE_MEMBLOCK_PHYS_MAP
bool

Expand Down
59 changes: 59 additions & 0 deletions mm/vmscan.c
Expand Up @@ -118,6 +118,19 @@ struct scan_control {
/* The file pages on the current node are dangerously low */
unsigned int file_is_tiny:1;

/*
* The clean file pages on the current node won't be reclaimed when
* their volume is below vm.clean_low_kbytes *unless* we threaten
* to OOM or have no swap space or vm.swappiness=0.
*/
unsigned int clean_below_low:1;

/*
* The clean file pages on the current node won't be reclaimed when
* their volume is below vm.clean_min_kbytes.
*/
unsigned int clean_below_min:1;

/* Allocation order */
s8 order;

Expand Down Expand Up @@ -164,6 +177,17 @@ struct scan_control {
#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
#endif

#if CONFIG_CLEAN_LOW_KBYTES < 0
#error "CONFIG_CLEAN_LOW_KBYTES must be >= 0"
#endif

#if CONFIG_CLEAN_MIN_KBYTES < 0
#error "CONFIG_CLEAN_MIN_KBYTES must be >= 0"
#endif

unsigned long sysctl_clean_low_kbytes __read_mostly = CONFIG_CLEAN_LOW_KBYTES;
unsigned long sysctl_clean_min_kbytes __read_mostly = CONFIG_CLEAN_MIN_KBYTES;

/*
* From 0 .. 200. Higher means more swappy.
*/
Expand Down Expand Up @@ -2281,6 +2305,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
goto out;
}

/*
* Force-scan anon if clean file pages is under vm.clean_min_kbytes
* or vm.clean_low_kbytes (unless the swappiness setting
* disagrees with swapping).
*/
if ((sc->clean_below_low || sc->clean_below_min) && swappiness) {
scan_balance = SCAN_ANON;
goto out;
}

/*
* If there is enough inactive page cache, we do not reclaim
* anything from the anonymous working right now.
Expand Down Expand Up @@ -2417,6 +2451,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
BUG();
}

/*
* Don't reclaim clean file pages when their volume is below
* vm.clean_min_kbytes.
*/
if (file && sc->clean_below_min)
scan = 0;

nr[lru] = scan;
}
}
Expand Down Expand Up @@ -2767,6 +2808,24 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
anon >> sc->priority;
}

if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) {
unsigned long reclaimable_file, dirty, clean;

reclaimable_file =
node_page_state(pgdat, NR_ACTIVE_FILE) +
node_page_state(pgdat, NR_INACTIVE_FILE) +
node_page_state(pgdat, NR_ISOLATED_FILE);
dirty = node_page_state(pgdat, NR_FILE_DIRTY);
if (reclaimable_file > dirty)
clean = (reclaimable_file - dirty) << (PAGE_SHIFT - 10);

sc->clean_below_low = clean < sysctl_clean_low_kbytes;
sc->clean_below_min = clean < sysctl_clean_min_kbytes;
} else {
sc->clean_below_low = false;
sc->clean_below_min = false;
}

shrink_node_memcgs(pgdat, sc);

if (reclaim_state) {
Expand Down

0 comments on commit a5eeb8d

Please sign in to comment.