Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait. User Managed Concurrency Groups is an M:N threading toolkit that allows constructing user space schedulers designed to efficiently manage heterogeneous in-process workloads while maintaining high CPU utilization (95%+). In addition, M:N threading and cooperative user space scheduling enables synchronous coding style and better cache locality when compared to asynchronous callback/continuation style of programming. UMCG kernel API is build around the following ideas: * UMCG server: a task/thread representing "kernel threads", or (v)CPUs; * UMCG worker: a task/thread representing "application threads", to be scheduled over servers; * UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a server or a worker) can be in; * UMCG task state flag: LOCKED, PREEMPTED: additional state flags that can be ORed with the task state to communicate additional information to the kernel; * struct umcg_task: a per-task userspace set of data fields, usually residing in the TLS, that fully reflects the current task's UMCG state and controls the way the kernel manages the task; * sys_umcg_ctl(): a syscall used to register the current task/thread as a server or a worker, or to unregister a UMCG task; * sys_umcg_wait(): a syscall used to put the current task to sleep and/or wake another task, pontentially context-switching between the two tasks on-CPU synchronously. In short, servers can be thought of as CPUs over which application threads (workers) are scheduled; at any one time a worker is either: - RUNNING: has a server and is schedulable by the kernel; - BLOCKED: blocked in the kernel (e.g. on I/O, or a futex); - IDLE: is not blocked, but cannot be scheduled by the kernel to run because it has no server assigned to it (e.g. because all available servers are busy "running" other workers). Usually the number of servers in a process is equal to the number of CPUs available to the kernel if the process is supposed to consume the whole machine, or less than the number of CPUs available if the process is sharing the machine with other workloads. The number of workers in a process can grow very large: tens of thousands is normal; hundreds of thousands and more (millions) is something that would be desirable to achieve in the future, as lightweight userspace threads in Java and Go easily scale to millions, and UMCG workers are (intended to be) conceptually similar to those. Detailed use cases and API behavior are provided in Documentation/userspace-api/umcg.[txt|rst] (see sibling patches). Some high-level implementation notes: UMCG tasks (workers and servers) are "tagged" with struct umcg_task residing in userspace (usually in TLS) to facilitate kernel/userspace communication. This makes the kernel-side code much simpler (see e.g. the implementation of sys_umcg_wait), but also requires some careful uaccess handling and page pinning (see below). The main UMCG server/worker interaction looks like: a. worker W1 is RUNNING, with a server S attached to it sleeping in IDLE state; b. worker W1 blocks in the kernel, e.g. on I/O; c. the kernel marks W1 as BLOCKED, the attached server S as RUNNING, and wakes S (the "block detection" event); d. the server now picks another IDLE worker W2 to run: marks W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait(); e. when the blocking operation of W1 completes, the worker is marked by the kernel as IDLE and added to idle workers list (see struct umcg_task) for the userspace to pick up and later run (the "wake detection" event). While there are additional operations such as worker-to-worker context switch, preemption, workers "yielding", etc., the "workflow" above is the main worker/server interaction that drives the implementation. Specifically: - most operations are conceptually context switches: - scheduling a worker: a running server goes to sleep and "runs" a worker in its place; - block detection: worker is descheduled, and its server is woken; - wake detection: woken worker, running in the kernel, is descheduled, and if there is an idle server, it is woken to process the wake detection event; - to faciliate low scheduling latencies and cache locality, most server/worker interactions described above are performed synchronously "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment the context switches are simulated by putting the switch-out task to sleep and waking the switch-into task on the same cpu, it is very much the long-term goal of this project to make the context switch much lighter, by tweaking runtime accounting and, maybe, even bypassing __schedule(); - worker blocking is detected in a hook to sched_submit_work; as mentioned above, the server is to be woken on the same CPU, synchronously; this code may not pagefault, so to access worker's and server's userspace memory (struct umcg_task), memory pages containing the worker's and the server's structs umcg_task are pinned when the worker is exiting to the userspace, and unpinned when the worker is descheduled; - worker wakeup is detected in a hook to sched_update_worker, and processed in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN pagefault on the wakeup path; - worker preemption is implemented by the userspace tagging the worker with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it; on the exit to usermode the worker is intercepted and its server is woken (see Documentation/userspace-api/umcg.[txt|rst] for more details); - each state change is tagged with a unique timestamp (of MONOTONIC variety), so that - scheduling instrumentation is naturally available; - racing state changes are easily detected and ABA issues are avoided; see umcg_update_state() in umcg.c for implementation details, and Documentation/userspace-api/umcg.[txt|rst] for a higher-level description. The previous version of the patchset can be found at https://lore.kernel.org/all/20210917180323.278250-1-posk@google.com/ containing some additional context and links to earlier discussions. More details are available in Documentation/userspace-api/umcg.[txt|rst] in sibling patches, and in doc-comments in the code. Signed-off-by: Peter Oskolkov <posk@google.com>
- Loading branch information