Skip to content

resctrl

kmabbasi edited this page Jul 7, 2021 · 13 revisions

1 Resource Control

This document describes how to use L3 CAT, L2 CAT and presents an overview of resctrl filesystem’s structure. For more detailed information regarding resctrl interface please read resctrl_ui.txt from Linux documentation.

1.1 Resctrl

Resource Control (resctrl) is a kernel interface for CPU resource allocation using Intel(R) Resource Director Technology. The resctrl interface is available in kernels 4.10 and newer. Currently, Resource Control supports L2 CAT, L3 CAT and L3 CDP which allows partitioning L2 and L3 cache on a per core/task basis. It also supports MBA, the maximum bandwidth can be specified in percentage or in megabytes per second (with an optional mba_MBps flag).

If you wish to use this before the release of kernel 4.10 you can download from here kernel.googlesource x86/cache

Make sure you have set the appropriate kernel flag, this is done in the kernel .config file:

# Intel Resource Director Technology Allocation support (INTEL_RDT_A) [N/y/?] (NEW) y

1.1.1 Resctrl structure

Resource control is a mountable virtual file system located at /sys/fs/resctrl. To mount this system use the command:

# mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl

When selected, the cdp and cdpl2 mount options allow for code/data prioritization in L3 and L2 cache allocations, respectively. The mba_MBps mount option enables MBA controller, MBA maximum rates are specified in megabytes per second instead of percentage then. Once mounted, the default resource group and an info directory become visible. If this mount option is not used, MBA (in percentage) is enabled.

Info Directory

The info directory contains hardware specific information on a per cache level basis. For example all information on L3 cache is located in info/L3. It contains the following files: “bit_usage”:

  • Annotated capacity bitmasks showing how all instances of the resource are used (e.g. “0=XXSSSSSSSSS;1=XXSSSSSSSSS”).

“num_closids”:

  • The number of unique COS configurations available for this resource (L3/L2/CDP). The kernel uses the smallest number of COS for all enabled resources as the limit (e.g. 8). “cbm_mask”:

  • The max bit mask available for this resource (e.g. “7ff”).

“min_cbm_bits”:

  • The minimum number of consecutive bits that must be set for this resource (e.g. 1).

“shareable_bits”:

  • Bitmask of shareable resource with other executing entities (e.g. 600).

For monitoring features like L3 monitoring, information is stored in a relevant directory (info/L3_MON in this case). It contains the following files:

”max_threshold_occupancy”:

  • The the largest value in bytes at which a previously used LLC_occupancy counter can be considered for re-use.

”mon_features”:

  • The list of monitoring events if monitoring is enabled (e.g. “llc_occupancy”).

”num_rmids”:

  • The number of available RMID’s (e.g. 224).

Information on memory bandwidth is placed in info/MB directory. It contains the following files:

”bandwidth_gran”:

  • The granularity of the MBA configuration (e.g. 10).

”delay_linear”:

  • Indicates if the delay scale is linear or non-linear (1 for linear, 0 for non-linear).

”min_bandwidth”:

  • The minimum value for MBA configuration (e.g. 10).

”num_closids”:

  • The number of unique COS configurations available for MBA (e.g. 8).

Resource groups

Resource groups are represented as directories in the resctrl file system. The default group is the root directory. Other groups may be created as desired by the system administrator using the "mkdir(1)" command, and removed using "rmdir(1)". There are three files associated with each group:

"tasks":

A list of tasks that belongs to this group. Tasks can be added to a group by writing the task ID (PID) to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a PID is not in any sub partition, it is in root partition (i.e. default partition).

"cpus":

A bitmask of logical CPUs assigned to this group. Writing a new mask can add/remove CPUs to/from this group. Added CPUs are removed from their previous group. Removed ones are given to the default (root) group. You cannot remove CPUs from the default group.

"schemata":

A list of all the resources available to this group. Each resource has its own line and format. The format consists of a mask that controls the resources access. For example, a schemata for L3 cache will have a mask representing the cache ways available.

Keeping in mind that the cache can be allocated on per task or on per CPU basis, this is how the available resources are selected when a task is running:

  1. If the task is a member of a non-default group, then the schemata for that group is used.

  2. Else if the task belongs to the default group, but is running on a CPU that is assigned to some specific group, then the schemata for the CPU's group is used.

  3. Otherwise the schemata for the default group is used.

For example, if the task is assigned to COS1 and the CPU is assigned to COS2, then the schemata for COS1 will be used. But if the task is not assigned to any COS, but it is running on a CPU assigned to a specific COS, the schemata from that COS will be used.

Figure 1 shows the structure of resctrl

resctrl Structure

1.1.2 How to use resctrl

1.1.2.1 Overview

As explained above, new resource groups are created using mkdir. Once a new resource group is created the cpu, task and schemata files are automatically generated. The first free non-default Class of Service (COS) available is then associated with this new resource group. For example, the default resource group will be associated with COS 0, then when the first new resource group is created it will be given COS 1 and so on.

If a resource group is removed (rmdir) then the COS it was associated is freed up to be used again.

CPU file

As explained earlier, the cpu file contains a mask of the cpu’s associated with this resource group. To edit this association simply echo the new mask into the cpu file. For example, if your system has four logical cores the mask for all four will be 1111 in binary or F in hex. To change the cpu association to 2 cores for COS 1 type the following commands.

# mount –t resctrl resctrl [-o cdp] /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir COS1
# echo “3” > COS1/cpus

This will remove the cores associated with COS0 and associate them with COS1, so the files will look like the following.

# cat cpu
	c
# cat COS1/cpu
	3

Task file

The default task file contains all the PID’s of tasks running on the system. Like the cpu file above, if you move a task to a different resource group it will remove the PID from the default resource group.

Schemata file

Each line in the file describes one resource. The line starts with the name of the resource, followed by specific values to be applied in each of the instances of that resource on the system.

Cache IDs

On current generation systems there is one L3 cache per socket and L2 caches are generally just shared by the hyperthreads on a core, but this isn't an architectural requirement. We could have multiple separate L3 caches on a socket, multiple cores could share an L2 cache. So instead of using "socket" or "core" to define the set of logical cpus sharing a resource we use a "Cache ID". At a given cache level this will be a unique number across the whole system (but it isn't guaranteed to be a contiguous sequence, there may be gaps). To find the ID for each logical CPU look in

/sys/devices/system/cpu/cpu*/cache/index*/id

Some examples:

L3 details (code and data prioritization disabled) With CDP disabled, the L3 schemata format is:

L3:<cache_id0>=<mask>;<cache_id1>=<mask>;...

L3 details (CDP enabled via mount option to resctrl) When CDP is enabled, the L3 control is split into two separate resources so you can specify independent masks for code and data like this:

L3data:<cache_id0>=<mask>;<cache_id1>=<mask>;...
L3code:<cache_id0>=<mask>;<cache_id1>=<mask>;...

L2 details

L2 cache does not support code and data prioritization, so the schemata format is always:

L2:<cache_id0>=<mask>;<cache_id1>=<mask>;...

Figure 2 is a sequence diagram showing how to use rescrtl

Sequence Diagram of resctrl

Figure 3 is a block diagram showing how to use resctrl

resctrl new directory

1.1.2.2 Examples

Example 1

On a two socket machine (one L3 cache per socket) with just four bits for cache bit masks

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata

The default resource group is unmodified, so we have access to all parts of all caches (its schemata file reads "L3:0=f;1=f"). Tasks that are under the control of group "p0" may only allocate from the "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. Tasks in group "p1" use the "lower" 50% of cache on both sockets.

Example 2

Again two sockets, but this time with a more realistic 20-bit mask. Two real time tasks pid=1234 running on processor 0 and pid=5678 running on processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy neighbors, each of the two real-time tasks exclusively occupies one quarter of L3 cache on socket 0.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper" 50% of the L3 cache on socket 0 cannot be used by ordinary tasks:

# echo "L3:0=3ff;1=fffff" > schemata

Next we make a resource group for our first real time task and give it access to the "top" 25% of the cache on socket 0.

# mkdir p0
# echo "L3:0=f8000;1=fffff" > p0/schemata

Finally we move our first real time task into this resource group. We also use taskset(1) to ensure the task always runs on a dedicated CPU on socket 0. Most uses of resource groups will also constrain which processors tasks run on.

# echo 1234 > p0/tasks
# taskset -cp 1 1234

And for the second real time task (with the remaining 25% of cache):

# mkdir p1
# echo "L3:0=7c00;1=fffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2 5678

Example 3

A single socket system which has real-time tasks running on core 4-7 and non real-time workload assigned to core 0-3. The real-time tasks share text and data, so per task association is not required and due to interaction with the kernel it's desired that the kernel on these cores shares L3 with the tasks.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper" 50% of the L3 cache on socket 0 cannot be used by ordinary tasks:

# echo "L3:0=3ff" > schemata

Next we make a resource group for our real time cores and give it access to the "top" 50% of the cache on socket 0.

# mkdir p0
# echo "L3:0=ffc00;" > p0/schemata

Finally we move core 4-7 over to the new group and make sure that the kernel and the tasks running there get 50% of the cache.

# echo C0 > p0/cpus

1.1.3 How to debug resctrl using pqos

Pqos and resctrl use the same MSR’s to perform CAT. This means that you can use pqos to review your resource allocation at a system level. Pqos can also be used to provide useful information on data needed for resctrl. For example, if you wish to know all the cache id for all cores you can use:

# pqos –s –V

pqos-s-v

We can see that the cache id for L3 core 23 is 1 and for L2 it is 13.

Using the same pqos command we can see what the current COS configuration is for all COS’s:

pqos-s-v2

And finally we can also see the core to COS mapping using the pqos –s –V command:

pqos -s -V

Using this data we are able to get a system wide view of what has been altered using resctrl.