sched/umcg: implement UMCG syscalls

Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait. User Managed Concurrency Groups is an M:N threading toolkit that allows constructing user space schedulers designed to efficiently manage heterogeneous in-process workloads while maintaining high CPU utilization (95%+). In addition, M:N threading and cooperative user space scheduling enables synchronous coding style and better cache locality when compared to asynchronous callback/continuation style of programming. UMCG kernel API is build around the following ideas: * UMCG server: a task/thread representing "kernel threads", or (v)CPUs; * UMCG worker: a task/thread representing "application threads", to be scheduled over servers; * UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a server or a worker) can be in; * UMCG task state flag: LOCKED, PREEMPTED: additional state flags that can be ORed with the task state to communicate additional information to the kernel; * struct umcg_task: a per-task userspace set of data fields, usually residing in the TLS, that fully reflects the current task's UMCG state and controls the way the kernel manages the task; * sys_umcg_ctl(): a syscall used to register the current task/thread as a server or a worker, or to unregister a UMCG task; * sys_umcg_wait(): a syscall used to put the current task to sleep and/or wake another task, pontentially context-switching between the two tasks on-CPU synchronously. In short, servers can be thought of as CPUs over which application threads (workers) are scheduled; at any one time a worker is either: - RUNNING: has a server and is schedulable by the kernel; - BLOCKED: blocked in the kernel (e.g. on I/O, or a futex); - IDLE: is not blocked, but cannot be scheduled by the kernel to run because it has no server assigned to it (e.g. because all available servers are busy "running" other workers). Usually the number of servers in a process is equal to the number of CPUs available to the kernel if the process is supposed to consume the whole machine, or less than the number of CPUs available if the process is sharing the machine with other workloads. The number of workers in a process can grow very large: tens of thousands is normal; hundreds of thousands and more (millions) is something that would be desirable to achieve in the future, as lightweight userspace threads in Java and Go easily scale to millions, and UMCG workers are (intended to be) conceptually similar to those. Detailed use cases and API behavior are provided in Documentation/userspace-api/umcg.[txt|rst] (see sibling patches). Some high-level implementation notes: UMCG tasks (workers and servers) are "tagged" with struct umcg_task residing in userspace (usually in TLS) to facilitate kernel/userspace communication. This makes the kernel-side code much simpler (see e.g. the implementation of sys_umcg_wait), but also requires some careful uaccess handling and page pinning (see below). The main UMCG server/worker interaction looks like: a. worker W1 is RUNNING, with a server S attached to it sleeping in IDLE state; b. worker W1 blocks in the kernel, e.g. on I/O; c. the kernel marks W1 as BLOCKED, the attached server S as RUNNING, and wakes S (the "block detection" event); d. the server now picks another IDLE worker W2 to run: marks W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait(); e. when the blocking operation of W1 completes, the worker is marked by the kernel as IDLE and added to idle workers list (see struct umcg_task) for the userspace to pick up and later run (the "wake detection" event). While there are additional operations such as worker-to-worker context switch, preemption, workers "yielding", etc., the "workflow" above is the main worker/server interaction that drives the implementation. Specifically: - most operations are conceptually context switches: - scheduling a worker: a running server goes to sleep and "runs" a worker in its place; - block detection: worker is descheduled, and its server is woken; - wake detection: woken worker, running in the kernel, is descheduled, and if there is an idle server, it is woken to process the wake detection event; - to faciliate low scheduling latencies and cache locality, most server/worker interactions described above are performed synchronously "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment the context switches are simulated by putting the switch-out task to sleep and waking the switch-into task on the same cpu, it is very much the long-term goal of this project to make the context switch much lighter, by tweaking runtime accounting and, maybe, even bypassing __schedule(); - worker blocking is detected in a hook to sched_submit_work; as mentioned above, the server is to be woken on the same CPU, synchronously; this code may not pagefault, so to access worker's and server's userspace memory (struct umcg_task), memory pages containing the worker's and the server's structs umcg_task are pinned when the worker is exiting to the userspace, and unpinned when the worker is descheduled; - worker wakeup is detected in a hook to sched_update_worker, and processed in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN pagefault on the wakeup path; - worker preemption is implemented by the userspace tagging the worker with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it; on the exit to usermode the worker is intercepted and its server is woken (see Documentation/userspace-api/umcg.[txt|rst] for more details); - each state change is tagged with a unique timestamp (of MONOTONIC variety), so that - scheduling instrumentation is naturally available; - racing state changes are easily detected and ABA issues are avoided; see umcg_update_state() in umcg.c for implementation details, and Documentation/userspace-api/umcg.[txt|rst] for a higher-level description. The previous version of the patchset can be found at https://lore.kernel.org/all/20210917180323.278250-1-posk@google.com/ containing some additional context and links to earlier discussions. More details are available in Documentation/userspace-api/umcg.[txt|rst] in sibling patches, and in doc-comments in the code. Signed-off-by: Peter Oskolkov <posk@google.com>
0day-ci · Oct 12, 2021 · 0dcffc8 · 0dcffc8
1 parent 904d0dd
commit 0dcffc8
Show file tree

Hide file tree

Showing 13 changed files with 1,175 additions and 4 deletions.
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -370,6 +370,8 @@
 446	common	landlock_restrict_self	sys_landlock_restrict_self
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
+449	common	umcg_ctl		sys_umcg_ctl
+450	common	umcg_wait		sys_umcg_wait
 
 #
 # Due to a historical design error, certain syscalls are numbered differently

diff --git a/fs/exec.c b/fs/exec.c
@@ -1840,6 +1840,7 @@ static int bprm_execve(struct linux_binprm *bprm,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
+	umcg_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	return retval;

diff --git a/include/linux/sched.h b/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct umcg_task;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1296,6 +1297,12 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif
 
+#ifdef CONFIG_UMCG
+	struct umcg_task __user	*umcg_task;
+	struct page		*pinned_umcg_worker_page;  /* self */
+	struct page		*pinned_umcg_server_page;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	union {
@@ -1688,6 +1695,13 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+
+#ifdef CONFIG_UMCG
+#define PF_UMCG_WORKER		0x01000000	/* UMCG worker */
+#else
+#define PF_UMCG_WORKER		0x00000000
+#endif
+
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
@@ -2275,6 +2289,63 @@ static inline void rseq_execve(struct task_struct *t)
 
 #endif
 
+#ifdef CONFIG_UMCG
+
+void umcg_handle_resuming_worker(void);
+void umcg_handle_exiting_worker(void);
+void umcg_clear_child(struct task_struct *tsk);
+
+/* Called by bprm_execve() in fs/exec.c. */
+static inline void umcg_execve(struct task_struct *tsk)
+{
+	if (tsk->umcg_task)
+		umcg_clear_child(tsk);
+}
+
+/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
+static inline void umcg_handle_notify_resume(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_handle_resuming_worker();
+}
+
+/* Called by do_exit() in kernel/exit.c. */
+static inline void umcg_handle_exit(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_handle_exiting_worker();
+}
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+void umcg_wq_worker_sleeping(struct task_struct *tsk);
+void umcg_wq_worker_running(struct task_struct *tsk);
+
+#else  /* CONFIG_UMCG */
+
+static inline void umcg_clear_child(struct task_struct *tsk)
+{
+}
+static inline void umcg_execve(struct task_struct *tsk)
+{
+}
+static inline void umcg_handle_notify_resume(void)
+{
+}
+static inline void umcg_handle_exit(void)
+{
+}
+static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+}
+static inline void umcg_wq_worker_running(struct task_struct *tsk)
+{
+}
+
+#endif
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 void rseq_syscall(struct pt_regs *regs);

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
@@ -71,6 +71,7 @@ struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct umcg_task;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1052,6 +1053,8 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
 asmlinkage long sys_memfd_secret(unsigned int flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
 
 /*
  * Architecture-specific system calls

diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
@@ -879,9 +879,13 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
 #endif
 #define __NR_process_mrelease 448
 __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
+#define __NR_umcg_ctl 449
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 450
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
 
 #undef __NR_syscalls
-#define __NR_syscalls 449
+#define __NR_syscalls 451
 
 /*
  * 32 bit systems traditionally used different

diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ *      sys_umcg_ctl()  - register/unregister UMCG tasks;
+ *      sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+ */
+
+/*
+ * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
+ * The states represent the user space point of view.
+ */
+#define UMCG_TASK_NONE			0ULL
+#define UMCG_TASK_RUNNING		1ULL
+#define UMCG_TASK_IDLE			2ULL
+#define UMCG_TASK_BLOCKED		3ULL
+
+/* UMCG task state flags, bits 7-8 */
+
+/*
+ * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
+ */
+#define UMCG_TF_LOCKED			(1ULL << 6)
+
+/*
+ * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
+ */
+#define UMCG_TF_PREEMPTED		(1ULL << 7)
+
+/* The first six bits: RUNNING, IDLE, or BLOCKED. */
+#define UMCG_TASK_STATE_MASK		0x3fULL
+
+/* The full kernel state mask: the first 13 bits. */
+#define UMCG_TASK_STATE_MASK_FULL	0x1fffULL
+
+/*
+ * The number of bits reserved for UMCG state timestamp in
+ * struct umcg_task.state_ts.
+ */
+#define UMCG_STATE_TIMESTAMP_BITS	46
+
+/* The number of bits truncated from UMCG state timestamp. */
+#define UMCG_STATE_TIMESTAMP_GRANULARITY	4
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+	/**
+	 * @state_ts: the current state of the UMCG task described by
+	 *            this struct, with a unique timestamp indicating
+	 *            when the last state change happened.
+	 *
+	 * Readable/writable by both the kernel and the userspace.
+	 *
+	 * UMCG task state:
+	 *   bits  0 -  5: task state;
+	 *   bits  6 -  7: state flags;
+	 *   bits  8 - 12: reserved; must be zeroes;
+	 *   bits 13 - 17: for userspace use;
+	 *   bits 18 - 63: timestamp (see below).
+	 *
+	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
+	 * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+	 */
+	uint64_t	state_ts;		/* r/w */
+
+	/**
+	 * @next_tid: the TID of the UMCG task that should be context-switched
+	 *            into in sys_umcg_wait(). Can be zero.
+	 *
+	 * Running UMCG workers must have next_tid set to point to IDLE
+	 * UMCG servers.
+	 *
+	 * Read-only for the kernel, read/write for the userspace.
+	 */
+	uint32_t	next_tid;		/* r   */
+
+	uint32_t	flags;			/* Reserved; must be zero. */
+
+	/**
+	 * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, the userspace removes them.
+	 */
+	uint64_t	idle_workers_ptr;	/* r/w */
+
+	/**
+	 * @idle_server_tid_ptr: a pointer pointing to a single idle server.
+	 *                       Readonly.
+	 */
+	uint64_t	idle_server_tid_ptr;	/* r   */
+} __attribute__((packed, aligned(8 * sizeof(__u64))));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+	UMCG_CTL_REGISTER	= 0x00001,
+	UMCG_CTL_UNREGISTER	= 0x00002,
+	UMCG_CTL_WORKER		= 0x10000,
+};
+
+/**
+ * enum umcg_wait_flag - flags to pass to sys_umcg_wait
+ * @UMCG_WAIT_WAKE_ONLY:      wake @self->next_tid, don't put @self to sleep;
+ * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
+ *                            (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
+ *                            must be set.
+ */
+enum umcg_wait_flag {
+	UMCG_WAIT_WAKE_ONLY			= 1,
+	UMCG_WAIT_WF_CURRENT_CPU		= 2,
+};
+
+/* See Documentation/userspace-api/umcg.[txt|rst].*/
+#define UMCG_IDLE_NODE_PENDING (1ULL)
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
@@ -1688,6 +1688,16 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config UMCG
+	bool "Enable User Managed Concurrency Groups API"
+	depends on X86_64
+	default n
+	help
+	  Enable User Managed Concurrency Groups API, which form the basis
+	  for an in-process M:N userspace scheduling framework.
+	  At the moment this is an experimental/RFC feature that is not
+	  guaranteed to be backward-compatible.
+
 config KALLSYMS
 	bool "Load all symbols for debugging/ksymoops" if EXPERT
 	default y

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
@@ -171,8 +171,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 			handle_signal_work(regs, ti_work);
 
-		if (ti_work & _TIF_NOTIFY_RESUME)
+		if (ti_work & _TIF_NOTIFY_RESUME) {
+			umcg_handle_notify_resume();
 			tracehook_notify_resume(regs);
+		}
 
 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);

diff --git a/kernel/exit.c b/kernel/exit.c
@@ -745,6 +745,10 @@ void __noreturn do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");
 
+	/* Turn off UMCG sched hooks. */
+	if (unlikely(tsk->flags & PF_UMCG_WORKER))
+		tsk->flags &= ~PF_UMCG_WORKER;
+
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
@@ -781,6 +785,7 @@ void __noreturn do_exit(long code)
 
 	io_uring_files_cancel();
 	exit_signals(tsk);  /* sets PF_EXITING */
+	umcg_handle_exit();
 
 	/* sync mm's RSS info before statistics gathering */
 	if (tsk->mm)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
@@ -4236,6 +4236,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 #endif
+	umcg_clear_child(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6265,9 +6266,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * If a worker goes to sleep, notify and ask workqueue whether it
 	 * wants to wake up a task to maintain concurrency.
 	 */
-	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (task_flags & PF_WQ_WORKER)
 			wq_worker_sleeping(tsk);
+		else if (task_flags & PF_UMCG_WORKER)
+			umcg_wq_worker_sleeping(tsk);
 		else
 			io_wq_worker_sleeping(tsk);
 	}
@@ -6285,9 +6288,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
 
 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
+		else if (tsk->flags & PF_UMCG_WORKER)
+			umcg_wq_worker_running(tsk);
 		else
 			io_wq_worker_running(tsk);
 	}