Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
sched/umcg: implement UMCG syscalls
Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
User Managed Concurrency Groups is an M:N threading toolkit that allows
constructing user space schedulers designed to efficiently manage
heterogeneous in-process workloads while maintaining high CPU
utilization (95%+).
In addition, M:N threading and cooperative user space scheduling
enables synchronous coding style and better cache locality when
compared to asynchronous callback/continuation style of programming.
UMCG kernel API is build around the following ideas:
* UMCG server: a task/thread representing "kernel threads", or (v)CPUs;
* UMCG worker: a task/thread representing "application threads", to be
scheduled over servers;
* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
server or a worker) can be in;
* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
can be ORed with the task state to communicate additional information to
the kernel;
* struct umcg_task: a per-task userspace set of data fields, usually
residing in the TLS, that fully reflects the current task's UMCG state
and controls the way the kernel manages the task;
* sys_umcg_ctl(): a syscall used to register the current task/thread as a
server or a worker, or to unregister a UMCG task;
* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
wake another task, pontentially context-switching between the two tasks
on-CPU synchronously.
In short, servers can be thought of as CPUs over which application
threads (workers) are scheduled; at any one time a worker is either:
- RUNNING: has a server and is schedulable by the kernel;
- BLOCKED: blocked in the kernel (e.g. on I/O, or a futex);
- IDLE: is not blocked, but cannot be scheduled by the kernel to
run because it has no server assigned to it (e.g. because all
available servers are busy "running" other workers).
Usually the number of servers in a process is equal to the number of
CPUs available to the kernel if the process is supposed to consume
the whole machine, or less than the number of CPUs available if the
process is sharing the machine with other workloads. The number of
workers in a process can grow very large: tens of thousands is normal;
hundreds of thousands and more (millions) is something that would
be desirable to achieve in the future, as lightweight userspace
threads in Java and Go easily scale to millions, and UMCG workers
are (intended to be) conceptually similar to those.
Detailed use cases and API behavior are provided in
Documentation/userspace-api/umcg.[txt|rst] (see sibling patches).
Some high-level implementation notes:
UMCG tasks (workers and servers) are "tagged" with struct umcg_task
residing in userspace (usually in TLS) to facilitate kernel/userspace
communication. This makes the kernel-side code much simpler (see e.g.
the implementation of sys_umcg_wait), but also requires some careful
uaccess handling and page pinning (see below).
The main UMCG server/worker interaction looks like:
a. worker W1 is RUNNING, with a server S attached to it sleeping
in IDLE state;
b. worker W1 blocks in the kernel, e.g. on I/O;
c. the kernel marks W1 as BLOCKED, the attached server S
as RUNNING, and wakes S (the "block detection" event);
d. the server now picks another IDLE worker W2 to run: marks
W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait();
e. when the blocking operation of W1 completes, the worker
is marked by the kernel as IDLE and added to idle workers list
(see struct umcg_task) for the userspace to pick up and
later run (the "wake detection" event).
While there are additional operations such as worker-to-worker
context switch, preemption, workers "yielding", etc., the "workflow"
above is the main worker/server interaction that drives the
implementation.
Specifically:
- most operations are conceptually context switches:
- scheduling a worker: a running server goes to sleep and "runs"
a worker in its place;
- block detection: worker is descheduled, and its server is woken;
- wake detection: woken worker, running in the kernel, is descheduled,
and if there is an idle server, it is woken to process the wake
detection event;
- to faciliate low scheduling latencies and cache locality, most
server/worker interactions described above are performed synchronously
"on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment
the context switches are simulated by putting the switch-out task to
sleep and waking the switch-into task on the same cpu, it is very much
the long-term goal of this project to make the context switch much
lighter, by tweaking runtime accounting and, maybe, even bypassing
__schedule();
- worker blocking is detected in a hook to sched_submit_work; as mentioned
above, the server is to be woken on the same CPU, synchronously;
this code may not pagefault, so to access worker's and server's
userspace memory (struct umcg_task), memory pages containing the worker's
and the server's structs umcg_task are pinned when the worker is
exiting to the userspace, and unpinned when the worker is descheduled;
- worker wakeup is detected in a hook to sched_update_worker, and processed
in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN
pagefault on the wakeup path;
- worker preemption is implemented by the userspace tagging the worker
with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it;
on the exit to usermode the worker is intercepted and its server is woken
(see Documentation/userspace-api/umcg.[txt|rst] for more details);
- each state change is tagged with a unique timestamp (of MONOTONIC
variety), so that
- scheduling instrumentation is naturally available;
- racing state changes are easily detected and ABA issues are
avoided;
see umcg_update_state() in umcg.c for implementation details, and
Documentation/userspace-api/umcg.[txt|rst] for a higher-level
description.
The previous version of the patchset can be found at
https://lore.kernel.org/all/20210917180323.278250-1-posk@google.com/
containing some additional context and links to earlier discussions.
More details are available in Documentation/userspace-api/umcg.[txt|rst]
in sibling patches, and in doc-comments in the code.
Signed-off-by: Peter Oskolkov <posk@google.com>- Loading branch information