Skip to content

Commit

Permalink
hugetlb: Add hugetlb.*.numa_stat file
Browse files Browse the repository at this point in the history
For hugetlb backed jobs/VMs it's critical to understand the numa
information for the memory backing these jobs to deliver optimal
performance.

Currently this techinically can be queried from /proc/self/numa_maps, but
there are significant issues with that. Namely:
1. Memory can be mapped on unmapped.
2. numa_maps are per process and need to be aggregaged across all
   proceses in the cgroup. For shared memory this is more involved as
   the userspace needs to make sure it doesn't double count shared
   mappings.
3. I believe querying numa_maps needs to hold the mmap_lock which adds
   to the contention on this lock.

For these reasons I propose simply adding hugetlb.*.numa_stat file,
which shows the numa information of the cgroup similarly to
memory.numa_stat.

On cgroup-v2:
   cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
   total=2097152 N0=2097152 N1=0

On cgroup-v1:
   cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
   total=2097152 N0=2097152 N1=0
   hierarichal_total=2097152 N0=2097152 N1=0

This patch was tested manually by allocating hugetlb memory and querying
the hugetlb.*.numa_stat file of the cgroup and its parents.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jue Wang <juew@google.com>
Cc: Yang Yao <ygyao@google.com>
Cc: Joanna Li <joannali@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Signed-off-by: Mina Almasry <almasrymina@google.com>
  • Loading branch information
mina authored and intel-lab-lkp committed Oct 19, 2021
1 parent 66739f1 commit cdac71b
Show file tree
Hide file tree
Showing 6 changed files with 113 additions and 11 deletions.
4 changes: 4 additions & 0 deletions Documentation/admin-guide/cgroup-v1/hugetlb.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,14 @@ Brief summary of control files::
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup

For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include::

hugetlb.1GB.limit_in_bytes
hugetlb.1GB.max_usage_in_bytes
hugetlb.1GB.numa_stat
hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt
hugetlb.1GB.rsvd.limit_in_bytes
Expand All @@ -43,6 +45,7 @@ files include::
hugetlb.1GB.rsvd.failcnt
hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes
hugetlb.64KB.numa_stat
hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt
hugetlb.64KB.rsvd.limit_in_bytes
Expand All @@ -51,6 +54,7 @@ files include::
hugetlb.64KB.rsvd.failcnt
hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes
hugetlb.32MB.numa_stat
hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt
hugetlb.32MB.rsvd.limit_in_bytes
Expand Down
7 changes: 7 additions & 0 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2260,6 +2260,13 @@ HugeTLB Interface Files
are local to the cgroup i.e. not hierarchical. The file modified event
generated on this file reflects only the local events.

hugetlb.<hugepagesize>.numa_stat
Similar to memory.numa_stat, it shows the numa information of the
memory in this cgroup:

/dev/cgroup/memory/test # cat hugetlb.2MB.numa_stat
total=0 N0=0 N1=0

Misc
----

Expand Down
4 changes: 2 additions & 2 deletions include/linux/hugetlb.h
Original file line number Diff line number Diff line change
Expand Up @@ -624,8 +624,8 @@ struct hstate {
#endif
#ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
struct cftype cgroup_files_dfl[7];
struct cftype cgroup_files_legacy[9];
struct cftype cgroup_files_dfl[8];
struct cftype cgroup_files_legacy[10];
#endif
char name[HSTATE_NAME_LEN];
};
Expand Down
7 changes: 7 additions & 0 deletions include/linux/hugetlb_cgroup.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ enum hugetlb_memory_event {
HUGETLB_NR_MEMORY_EVENTS,
};

struct hugetlb_cgroup_per_node {
/* hugetlb usage in bytes over all hstates. */
unsigned long usage[HUGE_MAX_HSTATE];
};

struct hugetlb_cgroup {
struct cgroup_subsys_state css;

Expand All @@ -57,6 +62,8 @@ struct hugetlb_cgroup {

/* Handle for "hugetlb.events.local" */
struct cgroup_file events_local_file[HUGE_MAX_HSTATE];

struct hugetlb_cgroup_per_node *nodeinfo[];
};

static inline struct hugetlb_cgroup *
Expand Down
93 changes: 86 additions & 7 deletions mm/hugetlb_cgroup.c
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
struct hugetlb_cgroup *parent_h_cgroup)
{
int idx;
int node;

for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
struct page_counter *fault_parent = NULL;
Expand Down Expand Up @@ -124,6 +125,15 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
limit);
VM_BUG_ON(ret);
}

for_each_node(node) {
/* Set node_to_alloc to -1 for offline nodes. */
int node_to_alloc =
node_state(node, N_NORMAL_MEMORY) ? node : -1;
h_cgroup->nodeinfo[node] =
kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
GFP_KERNEL, node_to_alloc);
}
}

static struct cgroup_subsys_state *
Expand All @@ -132,7 +142,10 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
struct hugetlb_cgroup *h_cgroup;

h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
unsigned int size =
sizeof(*h_cgroup) +
MAX_NUMNODES * sizeof(struct hugetlb_cgroup_per_node *);
h_cgroup = kzalloc(size, GFP_KERNEL);
if (!h_cgroup)
return ERR_PTR(-ENOMEM);

Expand Down Expand Up @@ -292,7 +305,9 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
return;

__set_hugetlb_cgroup(page, h_cg, rsvd);
return;
if (!rsvd && h_cg)
h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages
<< PAGE_SHIFT;
}

void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
Expand Down Expand Up @@ -331,7 +346,9 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,

if (rsvd)
css_put(&h_cg->css);

else
h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages
<< PAGE_SHIFT;
return;
}

Expand Down Expand Up @@ -421,6 +438,56 @@ enum {
RES_RSVD_FAILCNT,
};

static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
{
int nid;
struct cftype *cft = seq_cft(seq);
int idx = MEMFILE_IDX(cft->private);
bool legacy = MEMFILE_ATTR(cft->private);
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
struct cgroup_subsys_state *css;
unsigned long usage;

if (legacy) {
/* Add up usage across all nodes for the non-hierarchical total. */
usage = 0;
for_each_node_state(nid, N_MEMORY)
usage += h_cg->nodeinfo[nid]->usage[idx];
seq_printf(seq, "total=%lu", usage);

/* Simply print the per-node usage for the non-hierarchical total. */
for_each_node_state(nid, N_MEMORY)
seq_printf(seq, " N%d=%lu", nid,
h_cg->nodeinfo[nid]->usage[idx]);
seq_putc(seq, '\n');
}

/* The hierarchical total is pretty much the value recorded by the
* counter, so use that.
*/
seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
(u64)page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);

/* For each node, transverse the css tree to obtain the hierarichal
* node usage.
*/
for_each_node_state(nid, N_MEMORY) {
usage = 0;
rcu_read_lock();
css_for_each_descendant_pre(css, &h_cg->css) {
usage += hugetlb_cgroup_from_css(css)
->nodeinfo[nid]
->usage[idx];
}
rcu_read_unlock();
seq_printf(seq, " N%d=%lu", nid, usage);
}

seq_putc(seq, '\n');

return 0;
}

static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
Expand Down Expand Up @@ -654,16 +721,22 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
cft->seq_show = hugetlb_cgroup_read_u64_max;
cft->flags = CFTYPE_NOT_ON_ROOT;

/* Add the events file */
/* Add the numa stat file */
cft = &h->cgroup_files_dfl[4];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
cft->seq_show = hugetlb_cgroup_read_numa_stat;
cft->flags = CFTYPE_NOT_ON_ROOT;

/* Add the events file */
cft = &h->cgroup_files_dfl[5];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
cft->private = MEMFILE_PRIVATE(idx, 0);
cft->seq_show = hugetlb_events_show;
cft->file_offset = offsetof(struct hugetlb_cgroup, events_file[idx]);
cft->flags = CFTYPE_NOT_ON_ROOT;

/* Add the events.local file */
cft = &h->cgroup_files_dfl[5];
cft = &h->cgroup_files_dfl[6];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
cft->private = MEMFILE_PRIVATE(idx, 0);
cft->seq_show = hugetlb_events_local_show;
Expand All @@ -672,7 +745,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
cft->flags = CFTYPE_NOT_ON_ROOT;

/* NULL terminate the last cft */
cft = &h->cgroup_files_dfl[6];
cft = &h->cgroup_files_dfl[7];
memset(cft, 0, sizeof(*cft));

WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
Expand Down Expand Up @@ -742,8 +815,14 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;

/* Add the numa stat file */
cft = &h->cgroup_files_dfl[8];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
cft->private = MEMFILE_PRIVATE(idx, 1);
cft->seq_show = hugetlb_cgroup_read_numa_stat;

/* NULL terminate the last cft */
cft = &h->cgroup_files_legacy[8];
cft = &h->cgroup_files_legacy[9];
memset(cft, 0, sizeof(*cft));

WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
Expand Down
9 changes: 7 additions & 2 deletions tools/testing/selftests/vm/write_to_hugetlbfs.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ static int shmid;
static void exit_usage(void)
{
printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
"[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
"[-o] [-w] [-n]\n",
"[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l(sleep)] [-r(private)] "
"[-o(populate)] [-w(rite)] [-n(o-reserve)]\n",
self);
exit(EXIT_FAILURE);
}
Expand Down Expand Up @@ -161,6 +161,11 @@ int main(int argc, char **argv)
else
printf("RESERVE mapping.\n");

if (want_sleep)
printf("Sleeping\n");
else
printf("Not sleeping\n");

switch (method) {
case HUGETLBFS:
printf("Allocating using HUGETLBFS.\n");
Expand Down

0 comments on commit cdac71b

Please sign in to comment.