Skip to content

Commit

Permalink
cgroup: Implement cgroup2 basic CPU usage accounting
Browse files Browse the repository at this point in the history
In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on.  cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.

cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default.  While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense.  Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.

This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.

* All accountings are done per-cpu and don't propagate immediately.
  It just bumps the per-cgroup per-cpu counters and links to the
  parent's updated list if not already on it.

* On a read, the per-cpu counters are collected into the global ones
  and then propagated upwards.  Only the per-cpu counters which have
  changed since the last read are propagated.

* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
  prefix.  Total usage is collected from scheduling events.  User/sys
  breakdown is sourced from tick sampling and adjusted to the usage
  using cputime_adjust().

This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).

v2: Minor changes and documentation updates as suggested by Waiman and
    Roman.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
  • Loading branch information
htejun committed Sep 25, 2017
1 parent d2cc5ed commit 041cd64
Show file tree
Hide file tree
Showing 7 changed files with 453 additions and 3 deletions.
9 changes: 9 additions & 0 deletions Documentation/cgroup-v2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -886,6 +886,15 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.

cpu.usage_usec
CPU time consumed in the subtree.

cpu.user_usec
User CPU time consumed in the subtree.

cpu.system_usec
System CPU time consumed in the subtree.


Controllers
===========
Expand Down
57 changes: 57 additions & 0 deletions include/linux/cgroup-defs.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#include <linux/refcount.h>
#include <linux/percpu-refcount.h>
#include <linux/percpu-rwsem.h>
#include <linux/u64_stats_sync.h>
#include <linux/workqueue.h>
#include <linux/bpf-cgroup.h>

Expand Down Expand Up @@ -254,6 +255,57 @@ struct css_set {
struct rcu_head rcu_head;
};

/*
* cgroup basic resource usage statistics. Accounting is done per-cpu in
* cgroup_cpu_stat which is then lazily propagated up the hierarchy on
* reads.
*
* When a stat gets updated, the cgroup_cpu_stat and its ancestors are
* linked into the updated tree. On the following read, propagation only
* considers and consumes the updated tree. This makes reading O(the
* number of descendants which have been active since last read) instead of
* O(the total number of descendants).
*
* This is important because there can be a lot of (draining) cgroups which
* aren't active and stat may be read frequently. The combination can
* become very expensive. By propagating selectively, increasing reading
* frequency decreases the cost of each read.
*/
struct cgroup_cpu_stat {
/*
* ->sync protects all the current counters. These are the only
* fields which get updated in the hot path.
*/
struct u64_stats_sync sync;
struct task_cputime cputime;

/*
* Snapshots at the last reading. These are used to calculate the
* deltas to propagate to the global counters.
*/
struct task_cputime last_cputime;

/*
* Child cgroups with stat updates on this cpu since the last read
* are linked on the parent's ->updated_children through
* ->updated_next.
*
* In addition to being more compact, singly-linked list pointing
* to the cgroup makes it unnecessary for each per-cpu struct to
* point back to the associated cgroup.
*
* Protected by per-cpu cgroup_cpu_stat_lock.
*/
struct cgroup *updated_children; /* terminated by self cgroup */
struct cgroup *updated_next; /* NULL iff not on the list */
};

struct cgroup_stat {
/* per-cpu statistics are collected into the folowing global counters */
struct task_cputime cputime;
struct prev_cputime prev_cputime;
};

struct cgroup {
/* self css with NULL ->ss, points back to this cgroup */
struct cgroup_subsys_state self;
Expand Down Expand Up @@ -353,6 +405,11 @@ struct cgroup {
*/
struct cgroup *dom_cgrp;

/* cgroup basic resource statistics */
struct cgroup_cpu_stat __percpu *cpu_stat;
struct cgroup_stat pending_stat; /* pending from children */
struct cgroup_stat stat;

/*
* list of pidlists, up to two for each namespace (one for procs, one
* for tasks); created on demand.
Expand Down
22 changes: 22 additions & 0 deletions include/linux/cgroup.h
Original file line number Diff line number Diff line change
Expand Up @@ -703,17 +703,39 @@ static inline void cpuacct_account_field(struct task_struct *tsk, int index,
u64 val) {}
#endif

void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix);

void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
void __cgroup_account_cputime_field(struct cgroup *cgrp,
enum cpu_usage_stat index, u64 delta_exec);

static inline void cgroup_account_cputime(struct task_struct *task,
u64 delta_exec)
{
struct cgroup *cgrp;

cpuacct_charge(task, delta_exec);

rcu_read_lock();
cgrp = task_dfl_cgroup(task);
if (cgroup_parent(cgrp))
__cgroup_account_cputime(cgrp, delta_exec);
rcu_read_unlock();
}

static inline void cgroup_account_cputime_field(struct task_struct *task,
enum cpu_usage_stat index,
u64 delta_exec)
{
struct cgroup *cgrp;

cpuacct_account_field(task, index, delta_exec);

rcu_read_lock();
cgrp = task_dfl_cgroup(task);
if (cgroup_parent(cgrp))
__cgroup_account_cputime_field(cgrp, index, delta_exec);
rcu_read_unlock();
}

#else /* CONFIG_CGROUPS */
Expand Down
2 changes: 1 addition & 1 deletion kernel/cgroup/Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
obj-y := cgroup.o namespace.o cgroup-v1.o
obj-y := cgroup.o stat.o namespace.o cgroup-v1.o

obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
obj-$(CONFIG_CGROUP_PIDS) += pids.o
Expand Down
8 changes: 8 additions & 0 deletions kernel/cgroup/cgroup-internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,

int cgroup_task_count(const struct cgroup *cgrp);

/*
* stat.c
*/
void cgroup_stat_flush(struct cgroup *cgrp);
int cgroup_stat_init(struct cgroup *cgrp);
void cgroup_stat_exit(struct cgroup *cgrp);
void cgroup_stat_boot(void);

/*
* namespace.c
*/
Expand Down
24 changes: 22 additions & 2 deletions kernel/cgroup/cgroup.c
Original file line number Diff line number Diff line change
Expand Up @@ -142,12 +142,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
};
#undef SUBSYS

static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat);

/*
* The default hierarchy, reserved for the subsystems that are otherwise
* unattached - it never has more than a single cgroup, and all tasks are
* part of that cgroup.
*/
struct cgroup_root cgrp_dfl_root;
struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat };
EXPORT_SYMBOL_GPL(cgrp_dfl_root);

/*
Expand Down Expand Up @@ -3301,6 +3303,8 @@ static int cgroup_stat_show(struct seq_file *seq, void *v)
seq_printf(seq, "nr_dying_descendants %d\n",
cgroup->nr_dying_descendants);

cgroup_stat_show_cputime(seq, "cpu.");

return 0;
}

Expand Down Expand Up @@ -4471,6 +4475,8 @@ static void css_free_work_fn(struct work_struct *work)
*/
cgroup_put(cgroup_parent(cgrp));
kernfs_put(cgrp->kn);
if (cgroup_on_dfl(cgrp))
cgroup_stat_exit(cgrp);
kfree(cgrp);
} else {
/*
Expand Down Expand Up @@ -4515,6 +4521,9 @@ static void css_release_work_fn(struct work_struct *work)
/* cgroup release path */
trace_cgroup_release(cgrp);

if (cgroup_on_dfl(cgrp))
cgroup_stat_flush(cgrp);

for (tcgrp = cgroup_parent(cgrp); tcgrp;
tcgrp = cgroup_parent(tcgrp))
tcgrp->nr_dying_descendants--;
Expand Down Expand Up @@ -4698,14 +4707,20 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
if (ret)
goto out_free_cgrp;

if (cgroup_on_dfl(parent)) {
ret = cgroup_stat_init(cgrp);
if (ret)
goto out_cancel_ref;
}

/*
* Temporarily set the pointer to NULL, so idr_find() won't return
* a half-baked cgroup.
*/
cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
if (cgrp->id < 0) {
ret = -ENOMEM;
goto out_cancel_ref;
goto out_stat_exit;
}

init_cgroup_housekeeping(cgrp);
Expand Down Expand Up @@ -4754,6 +4769,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)

return cgrp;

out_stat_exit:
if (cgroup_on_dfl(parent))
cgroup_stat_exit(cgrp);
out_cancel_ref:
percpu_ref_exit(&cgrp->self.refcnt);
out_free_cgrp:
Expand Down Expand Up @@ -5148,6 +5166,8 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));

cgroup_stat_boot();

/*
* The latency of the synchronize_sched() is too high for cgroups,
* avoid it at the cost of forcing all readers into the slow path.
Expand Down
Loading

0 comments on commit 041cd64

Please sign in to comment.