-
Notifications
You must be signed in to change notification settings - Fork 512
Memory management
This document describes BOINC's mechanisms related to RAM and swap space.
The design is based on the following (simplified) model of virtual memory:
-
Each process P has a 'virtual address space'. This can increase over time, e.g. as the program calls
malloc(). -
At any point, some pages of P's virtual address space are mapped to physical memory (RAM).
-
If P references a page that's not currently mapped (a 'page fault'), the OS allocates a page of RAM and maps it.
-
If no pages are free, the OS finds the least-recently-used (LRU) page (possibly belonging to another process P2). It writes that page to an area of disk called 'swap space', unmaps it from P2, and maps it in P.
The set of pages mapped in a process P is called its 'resident set', or 'working set' (I use these interchangeably). This is a function of the memory accesses both of P and of the other processes. The 'working set size' (WSS) can go up and down over time. For example, if the program scans a large array, the WSS might go way up, then gradually go back down as the RAM gets claimed by other processes.
NOTE: Wikipedia
has different definition of 'working set'.
Their definition has a parameter dt;
they define WSS(dt) as the set of pages referenced in the last dt 'time units'.
But what value should dt have?
And what does it even mean?
Does dt include time spent waiting for disk in page faults?
In any case, I don't think any OSs measure or use this notion of WSS.
If processes reference lots of memory, the system can enter at state where all RAM is used, and a large fraction of memory references result in page faults. Since disk I/O is far slower than RAM access, this causes all processes to run slowly; this is called 'thrashing'. Some OSs deal with it by swapping some processes entirely to disk and suspending them.
BOINC applications run at the lowest CPU priority. However, they can impact user-visible performance because of their memory usage:
- When the system is in use (i.e. when there's mouse/keyboard input), the memory usage of running BOINC apps can cause thrashing.
- If several user apps are open and the system is idle for a long period, the memory usage of BOINC apps may cause the user apps to be paged out. When the user eventually returns, it may take a while (10-20 seconds) for the user apps to get paged back in.
These effects can be minimized by limiting the memory usage of BOINC apps. However, this can reduce the CPU time available to BOINC, and on some systems BOINC would do no work at all. In general, the more computing BOINC does, the greater its potential impact on user-visible performance. Our design provides user preferences that adjust this tradeoff (see below).
We want to maximize the CPU efficiency of BOINC apps, i.e. to ensure that they don't thrash. On a multiprocessor, it may sometimes be better (in terms of throughput) to not use all available CPUs.
Some applications can trade off memory usage for speed (e.g. by using bigger hash tables), but beyond some point increasing memory usage causes thrashing and the advantage is negated. Such applications should be made aware of the current memory situation, so that they can adapt their usage accordingly.
BOINC has preferences:
ram_max_used_busy_frac
ram_max_used_idle_frac
These limit the total WSS of running apps (expressed as a fraction of total RAM) when the computer is in use and idle, respectively.
vm_max_used_frac
This limits the total virtual sizes of app processes (both running and suspended). It's expressed as a fraction of the swap space size. This is at most 1, preventing BOINC jobs from exceeding swap space size (and getting killed or causing other jobs to be killed). Lower values (the default is 0.75) ensure that swap space is available to non-BOINC apps.
Notes:
- This assumes that the entire virtual space may be swapped. This is not the case: for example, the part of the space containing the program is backed by the program file.
- On Win and Linux, swap space is a dedicated part of disk, and has a fixed size. On Mac it's dynamically allocated and potentially can use the entire disk not being used for files; BOINC currently assumes this. There are conflicting statements on Google saying that there's an additional limit in the range of 50 or 100 GB.
On startup, the client measures:
- The amount of RAM
- The amount of swap space (see above).
It measures the following periodically (every 10 seconds):
- For each running BOINC app: the working set size (for compound apps, this includes all processes).
To accommodate spikes in memory usage, BOINC also maintains a 'smoothed working set size' SWSS, computed as
SWSS = .5*SWSS + .5*WSS
where WSS is the new value.
- For each BOINC app: the virtual space size.
The RAM usage limit if enforced by suspending jobs: when a job is suspended, it stops referencing its resident pages, and they eventually get assigned to other processes.
The RAM usage limit (WSS) may be different between idle and in-use states.
avail_ram is the limit in the current state;
max_ram is the max of the limits.
Every 1 sec, ACTIVE_TASK_SET::check_rsc_limits_exceeded()
scans running jobs.
For a job J let WSS(J) denote its smoothed WSS.
If WSS(J) > max_ram, J is aborted.
If WSS(J) > avail_ram we trigger a reschedule,
which will preempt the job.
If the total WSS exceeds avail_ram we trigger a reschedule,
which will preempt at least one job.
In the CPU schedule (CLIENT_STATE::schedule_cpus()),
as we go through the list of runnable jobs,
keep track of the WSS used so far.
Skip any job that would case avail_ram to be exceeded;
if the job is running, preempt it.
What should the client use as WSS for a job J that hasn't run yet?
We don't want to start J if it's going to exceed RAM limits.
We could use the workunit rsc_memory_bound,
but most projects don't set that accurately.
We could use the max WSS of all jobs that used the same app version as J. But jobs for a given app version may vary widely over time. So instead we take the max WSS of currently running jobs that use the same app version as J.
The main problem is that resident set size differs from 'recently-used set size' (the Wikipedia notion). So for example, suppose
- the system has 10GB of RAM and 2 CPUs
-
ram_max_used_fracis set to 0.5 - there are 2 jobs, each of which has a resident set size of 4GB and a recently-used set size of 1 GB.
In the absence of other processes, BOINC will run just 1 job, starving the 2nd CPU. (In practice, other processes will gradually use the unused pages, so their resident set sizes will shrink down to 1 GB; so this maybe this isn't a real problem).
The client measures the 'virtual size' of each process. We assume that this entire size could be put in swap space (in practice it would be somewhat less).
Suspending a process doesn't reduce its swap usage; in fact it increases because its RAM pages will get paged out. So to reduce swap usage, the client needs to kill processes. These jobs will eventually be restarted, but they'll resume from their last checkpoint (it any) and computing time will be lost; we want to minimize this. We also need to ensure that, given a set of jobs, we don't kill them in a cycle; that could result in none of them ever finishing.
When we kill a job in this way, we say that it is 'swap-killed'. We added a field
ACTIVE_TASK::swap_killed
We added the following logic at the start of schedule_cpus():
X = sum of virtual sizes of running tasks
Z = usable swap space
if X > Z
swap_kill one or more tasks to bring total within limit Z
order: increasing CPU time since last checkpoint
(minimize wasted computing)
set swap_kill of those tasks
run_list = remaining executing tasks
call enforce_run_list() to run those tasks, subject to WSS limits
else (running tasks fit in swap)
A = list of swap-killed tasks
if A is empty
use normal scheduling logic
else
run_list = running tasks
scan swap_killed tasks T (in increasing deadline order)
if T would exceed swap limit
break;
T.swap_kill = false
add T to run list
call enforce_run_list() to run those tasks, subject to WSS limits
The BOINC_STATUS structure contains:
double working_set_size; // app's current WS (non-smoothed)
double max_working_set_size; // app will be aborted if WS exceeds thisSo the app might size arrays to fit in the difference between these.
Each workunit includes:
-
rsc_memory_bound: an estimate of the app's largest working set size.
Note: most projects supply inaccurate (usually too small) values.
A result is sent to a client only if
rsc_memory_bound < (RAM size)*max(ram_max_used_frac_busy, ram_max_used_frac_idle)
In other words, a job is sent only if the client can run it at least some of the time.
Possible ideas:
-
Measure non-BOINC RAM usage (WSS size). The obvious policy is: if non-BOINC RAM usage is X, BOINC can use total-X. But this may not be effective; BOINC apps run continuously and other apps run sporadically, so the other apps will tend to have small (or zero) working sets.
-
Measure non-BOINC swap usage, and limit BOINC apps to the remainder of swap space.
-
Measure page-fault rates for each process, and suspend BOINC apps as needed to limit this. Problem: this info doesn't seem to be available on Win; the reported page fault rate includes faults that don't read from disk.
-
Make the round-robin simulator aware of memory issues.