Skip to content

Commit

Permalink
sched/umcg: implement UMCG syscalls
Browse files Browse the repository at this point in the history
Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.

User Managed Concurrency Groups is an M:N threading toolkit that allows
constructing user space schedulers designed to efficiently manage
heterogeneous in-process workloads while maintaining high CPU
utilization (95%+).

In addition, M:N threading and cooperative user space scheduling
enables synchronous coding style and better cache locality when
compared to asynchronous callback/continuation style of programming.

UMCG kernel API is build around the following ideas:

* UMCG server: a task/thread representing "kernel threads", or (v)CPUs;
* UMCG worker: a task/thread representing "application threads", to be
  scheduled over servers;
* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
  server or a worker) can be in;
* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
  can be ORed with the task state to communicate additional information to
  the kernel;
* struct umcg_task: a per-task userspace set of data fields, usually
  residing in the TLS, that fully reflects the current task's UMCG state
  and controls the way the kernel manages the task;
* sys_umcg_ctl(): a syscall used to register the current task/thread as a
  server or a worker, or to unregister a UMCG task;
* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
  wake another task, pontentially context-switching between the two tasks
  on-CPU synchronously.

In short, servers can be thought of as CPUs over which application
threads (workers) are scheduled; at any one time a worker is either:
- RUNNING: has a server and is schedulable by the kernel;
- BLOCKED: blocked in the kernel (e.g. on I/O, or a futex);
- IDLE: is not blocked, but cannot be scheduled by the kernel to
  run because it has no server assigned to it (e.g. because all
  available servers are busy "running" other workers).

Usually the number of servers in a process is equal to the number of
CPUs available to the kernel if the process is supposed to consume
the whole machine, or less than the number of CPUs available if the
process is sharing the machine with other workloads. The number of
workers in a process can grow very large: tens of thousands is normal;
hundreds of thousands and more (millions) is something that would
be desirable to achieve in the future, as lightweight userspace
threads in Java and Go easily scale to millions, and UMCG workers
are (intended to be) conceptually similar to those.

Detailed use cases and API behavior are provided in
Documentation/userspace-api/umcg.[txt|rst] (see sibling patches).

Some high-level implementation notes:

UMCG tasks (workers and servers) are "tagged" with struct umcg_task
residing in userspace (usually in TLS) to facilitate kernel/userspace
communication. This makes the kernel-side code much simpler (see e.g.
the implementation of sys_umcg_wait), but also requires some careful
uaccess handling and page pinning (see below).

The main UMCG server/worker interaction looks like:

a. worker W1 is RUNNING, with a server S attached to it sleeping
   in IDLE state;
b. worker W1 blocks in the kernel, e.g. on I/O;
c. the kernel marks W1 as BLOCKED, the attached server S
   as RUNNING, and wakes S (the "block detection" event);
d. the server now picks another IDLE worker W2 to run: marks
   W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait();
e. when the blocking operation of W1 completes, the worker
   is marked by the kernel as IDLE and added to idle workers list
   (see struct umcg_task) for the userspace to pick up and
   later run (the "wake detection" event).

While there are additional operations such as worker-to-worker
context switch, preemption, workers "yielding", etc., the "workflow"
above is the main worker/server interaction that drives the
implementation.

Specifically:

- most operations are conceptually context switches:
    - scheduling a worker: a running server goes to sleep and "runs"
      a worker in its place;
    - block detection: worker is descheduled, and its server is woken;
    - wake detection: woken worker, running in the kernel, is descheduled,
      and if there is an idle server, it is woken to process the wake
      detection event;
- to faciliate low scheduling latencies and cache locality, most
  server/worker interactions described above are performed synchronously
  "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment
  the context switches are simulated by putting the switch-out task to
  sleep and waking the switch-into task on the same cpu, it is very much
  the long-term goal of this project to make the context switch much
  lighter, by tweaking runtime accounting and, maybe, even bypassing
  __schedule();
- worker blocking is detected in a hook to sched_submit_work; as mentioned
  above, the server is to be woken on the same CPU, synchronously;
  this code may not pagefault, so to access worker's and server's
  userspace memory (struct umcg_task), memory pages containing the worker's
  and the server's structs umcg_task are pinned when the worker is
  exiting to the userspace, and unpinned when the worker is descheduled;
- worker wakeup is detected in a hook to sched_update_worker, and processed
  in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN
  pagefault on the wakeup path;
- worker preemption is implemented by the userspace tagging the worker
  with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it;
  on the exit to usermode the worker is intercepted and its server is woken
  (see Documentation/userspace-api/umcg.[txt|rst] for more details);
- each state change is tagged with a unique timestamp (of MONOTONIC
  variety), so that
    - scheduling instrumentation is naturally available;
    - racing state changes are easily detected and ABA issues are
      avoided;
  see umcg_update_state() in umcg.c for implementation details, and
  Documentation/userspace-api/umcg.[txt|rst] for a higher-level
  description.

The previous version of the patchset can be found at
https://lore.kernel.org/all/20210917180323.278250-1-posk@google.com/
containing some additional context and links to earlier discussions.

More details are available in Documentation/userspace-api/umcg.[txt|rst]
in sibling patches, and in doc-comments in the code.

Signed-off-by: Peter Oskolkov <posk@google.com>
  • Loading branch information
posk-io authored and intel-lab-lkp committed Oct 12, 2021
1 parent 904d0dd commit 0dcffc8
Show file tree
Hide file tree
Showing 13 changed files with 1,175 additions and 4 deletions.
2 changes: 2 additions & 0 deletions arch/x86/entry/syscalls/syscall_64.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,8 @@
446 common landlock_restrict_self sys_landlock_restrict_self
447 common memfd_secret sys_memfd_secret
448 common process_mrelease sys_process_mrelease
449 common umcg_ctl sys_umcg_ctl
450 common umcg_wait sys_umcg_wait

#
# Due to a historical design error, certain syscalls are numbered differently
Expand Down
1 change: 1 addition & 0 deletions fs/exec.c
Original file line number Diff line number Diff line change
Expand Up @@ -1840,6 +1840,7 @@ static int bprm_execve(struct linux_binprm *bprm,
current->fs->in_exec = 0;
current->in_execve = 0;
rseq_execve(current);
umcg_execve(current);
acct_update_integrals(current);
task_numa_free(current, false);
return retval;
Expand Down
71 changes: 71 additions & 0 deletions include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
struct umcg_task;

/*
* Task state bitmask. NOTE! These bits are also
Expand Down Expand Up @@ -1296,6 +1297,12 @@ struct task_struct {
unsigned long rseq_event_mask;
#endif

#ifdef CONFIG_UMCG
struct umcg_task __user *umcg_task;
struct page *pinned_umcg_worker_page; /* self */
struct page *pinned_umcg_server_page;
#endif

struct tlbflush_unmap_batch tlb_ubc;

union {
Expand Down Expand Up @@ -1688,6 +1695,13 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */

#ifdef CONFIG_UMCG
#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
#else
#define PF_UMCG_WORKER 0x00000000
#endif

#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
Expand Down Expand Up @@ -2275,6 +2289,63 @@ static inline void rseq_execve(struct task_struct *t)

#endif

#ifdef CONFIG_UMCG

void umcg_handle_resuming_worker(void);
void umcg_handle_exiting_worker(void);
void umcg_clear_child(struct task_struct *tsk);

/* Called by bprm_execve() in fs/exec.c. */
static inline void umcg_execve(struct task_struct *tsk)
{
if (tsk->umcg_task)
umcg_clear_child(tsk);
}

/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
static inline void umcg_handle_notify_resume(void)
{
if (current->flags & PF_UMCG_WORKER)
umcg_handle_resuming_worker();
}

/* Called by do_exit() in kernel/exit.c. */
static inline void umcg_handle_exit(void)
{
if (current->flags & PF_UMCG_WORKER)
umcg_handle_exiting_worker();
}

/*
* umcg_wq_worker_[sleeping|running] are called in core.c by
* sched_submit_work() and sched_update_worker().
*/
void umcg_wq_worker_sleeping(struct task_struct *tsk);
void umcg_wq_worker_running(struct task_struct *tsk);

#else /* CONFIG_UMCG */

static inline void umcg_clear_child(struct task_struct *tsk)
{
}
static inline void umcg_execve(struct task_struct *tsk)
{
}
static inline void umcg_handle_notify_resume(void)
{
}
static inline void umcg_handle_exit(void)
{
}
static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
{
}
static inline void umcg_wq_worker_running(struct task_struct *tsk)
{
}

#endif

#ifdef CONFIG_DEBUG_RSEQ

void rseq_syscall(struct pt_regs *regs);
Expand Down
3 changes: 3 additions & 0 deletions include/linux/syscalls.h
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ struct open_how;
struct mount_attr;
struct landlock_ruleset_attr;
enum landlock_rule_type;
struct umcg_task;

#include <linux/types.h>
#include <linux/aio_abi.h>
Expand Down Expand Up @@ -1052,6 +1053,8 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
const void __user *rule_attr, __u32 flags);
asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
asmlinkage long sys_memfd_secret(unsigned int flags);
asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);

/*
* Architecture-specific system calls
Expand Down
6 changes: 5 additions & 1 deletion include/uapi/asm-generic/unistd.h
Original file line number Diff line number Diff line change
Expand Up @@ -879,9 +879,13 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
#endif
#define __NR_process_mrelease 448
__SYSCALL(__NR_process_mrelease, sys_process_mrelease)
#define __NR_umcg_ctl 449
__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
#define __NR_umcg_wait 450
__SYSCALL(__NR_umcg_wait, sys_umcg_wait)

#undef __NR_syscalls
#define __NR_syscalls 449
#define __NR_syscalls 451

/*
* 32 bit systems traditionally used different
Expand Down
137 changes: 137 additions & 0 deletions include/uapi/linux/umcg.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
#ifndef _UAPI_LINUX_UMCG_H
#define _UAPI_LINUX_UMCG_H

#include <linux/limits.h>
#include <linux/types.h>

/*
* UMCG: User Managed Concurrency Groups.
*
* Syscalls (see kernel/sched/umcg.c):
* sys_umcg_ctl() - register/unregister UMCG tasks;
* sys_umcg_wait() - wait/wake/context-switch.
*
* struct umcg_task (below): controls the state of UMCG tasks.
*
* See Documentation/userspace-api/umcg.[txt|rst] for detals.
*/

/*
* UMCG task states, the first 6 bits of struct umcg_task.state_ts.
* The states represent the user space point of view.
*/
#define UMCG_TASK_NONE 0ULL
#define UMCG_TASK_RUNNING 1ULL
#define UMCG_TASK_IDLE 2ULL
#define UMCG_TASK_BLOCKED 3ULL

/* UMCG task state flags, bits 7-8 */

/*
* UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
*/
#define UMCG_TF_LOCKED (1ULL << 6)

/*
* UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
*/
#define UMCG_TF_PREEMPTED (1ULL << 7)

/* The first six bits: RUNNING, IDLE, or BLOCKED. */
#define UMCG_TASK_STATE_MASK 0x3fULL

/* The full kernel state mask: the first 13 bits. */
#define UMCG_TASK_STATE_MASK_FULL 0x1fffULL

/*
* The number of bits reserved for UMCG state timestamp in
* struct umcg_task.state_ts.
*/
#define UMCG_STATE_TIMESTAMP_BITS 46

/* The number of bits truncated from UMCG state timestamp. */
#define UMCG_STATE_TIMESTAMP_GRANULARITY 4

/**
* struct umcg_task - controls the state of UMCG tasks.
*
* The struct is aligned at 64 bytes to ensure that it fits into
* a single cache line.
*/
struct umcg_task {
/**
* @state_ts: the current state of the UMCG task described by
* this struct, with a unique timestamp indicating
* when the last state change happened.
*
* Readable/writable by both the kernel and the userspace.
*
* UMCG task state:
* bits 0 - 5: task state;
* bits 6 - 7: state flags;
* bits 8 - 12: reserved; must be zeroes;
* bits 13 - 17: for userspace use;
* bits 18 - 63: timestamp (see below).
*
* Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
* See Documentation/userspace-api/umcg.[txt|rst] for detals.
*/
uint64_t state_ts; /* r/w */

/**
* @next_tid: the TID of the UMCG task that should be context-switched
* into in sys_umcg_wait(). Can be zero.
*
* Running UMCG workers must have next_tid set to point to IDLE
* UMCG servers.
*
* Read-only for the kernel, read/write for the userspace.
*/
uint32_t next_tid; /* r */

uint32_t flags; /* Reserved; must be zero. */

/**
* @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
*
* Readable/writable by both the kernel and the userspace: the
* kernel adds items to the list, the userspace removes them.
*/
uint64_t idle_workers_ptr; /* r/w */

/**
* @idle_server_tid_ptr: a pointer pointing to a single idle server.
* Readonly.
*/
uint64_t idle_server_tid_ptr; /* r */
} __attribute__((packed, aligned(8 * sizeof(__u64))));

/**
* enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
* @UMCG_CTL_REGISTER: register the current task as a UMCG task
* @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
* @UMCG_CTL_WORKER: register the current task as a UMCG worker
*/
enum umcg_ctl_flag {
UMCG_CTL_REGISTER = 0x00001,
UMCG_CTL_UNREGISTER = 0x00002,
UMCG_CTL_WORKER = 0x10000,
};

/**
* enum umcg_wait_flag - flags to pass to sys_umcg_wait
* @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
* @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
* (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
* must be set.
*/
enum umcg_wait_flag {
UMCG_WAIT_WAKE_ONLY = 1,
UMCG_WAIT_WF_CURRENT_CPU = 2,
};

/* See Documentation/userspace-api/umcg.[txt|rst].*/
#define UMCG_IDLE_NODE_PENDING (1ULL)

#endif /* _UAPI_LINUX_UMCG_H */
10 changes: 10 additions & 0 deletions init/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -1688,6 +1688,16 @@ config MEMBARRIER

If unsure, say Y.

config UMCG
bool "Enable User Managed Concurrency Groups API"
depends on X86_64
default n
help
Enable User Managed Concurrency Groups API, which form the basis
for an in-process M:N userspace scheduling framework.
At the moment this is an experimental/RFC feature that is not
guaranteed to be backward-compatible.

config KALLSYMS
bool "Load all symbols for debugging/ksymoops" if EXPERT
default y
Expand Down
4 changes: 3 additions & 1 deletion kernel/entry/common.c
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
handle_signal_work(regs, ti_work);

if (ti_work & _TIF_NOTIFY_RESUME)
if (ti_work & _TIF_NOTIFY_RESUME) {
umcg_handle_notify_resume();
tracehook_notify_resume(regs);
}

/* Architecture specific TIF work */
arch_exit_to_user_mode_work(regs, ti_work);
Expand Down
5 changes: 5 additions & 0 deletions kernel/exit.c
Original file line number Diff line number Diff line change
Expand Up @@ -745,6 +745,10 @@ void __noreturn do_exit(long code)
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");

/* Turn off UMCG sched hooks. */
if (unlikely(tsk->flags & PF_UMCG_WORKER))
tsk->flags &= ~PF_UMCG_WORKER;

/*
* If do_exit is called because this processes oopsed, it's possible
* that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
Expand Down Expand Up @@ -781,6 +785,7 @@ void __noreturn do_exit(long code)

io_uring_files_cancel();
exit_signals(tsk); /* sets PF_EXITING */
umcg_handle_exit();

/* sync mm's RSS info before statistics gathering */
if (tsk->mm)
Expand Down
1 change: 1 addition & 0 deletions kernel/sched/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
obj-$(CONFIG_PSI) += psi.o
obj-$(CONFIG_SCHED_CORE) += core_sched.o
obj-$(CONFIG_UMCG) += umcg.o
9 changes: 7 additions & 2 deletions kernel/sched/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -4236,6 +4236,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
#endif
umcg_clear_child(p);
}

DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
Expand Down Expand Up @@ -6265,9 +6266,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
* If a worker goes to sleep, notify and ask workqueue whether it
* wants to wake up a task to maintain concurrency.
*/
if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
else if (task_flags & PF_UMCG_WORKER)
umcg_wq_worker_sleeping(tsk);
else
io_wq_worker_sleeping(tsk);
}
Expand All @@ -6285,9 +6288,11 @@ static inline void sched_submit_work(struct task_struct *tsk)

static void sched_update_worker(struct task_struct *tsk)
{
if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
else if (tsk->flags & PF_UMCG_WORKER)
umcg_wq_worker_running(tsk);
else
io_wq_worker_running(tsk);
}
Expand Down
Loading

0 comments on commit 0dcffc8

Please sign in to comment.