Skip to content

AddX86Support

Thomas Gruber edited this page Aug 8, 2022 · 2 revisions

Introduction

Adding new x86_64 architectures to LIKWID is more work than for ARM. The reason is that LIKWID provides multiple counter access backends for x86_64: direct access, daemon access and perf_event.

Add hardware topology information

At first, LIKWID requires some IDs from the hardware to identify the platform. Although the hwloc library gathers most information, LIKWID additionally reads /proc/cpuinfo and reads the following fields:

  • cpu family: For Intel commonly 0x6, for AMD different cpu families per CPU generation.
  • model: Differentiate the model of the CPU
  • stepping: Vendor internal revision of CPU model
  • model name: Name of the CPU model

When you have these IDs at hand (in hexadecimal format), you add them to src/includes/topology.h. Here is a snippet of this file:

#define  P6_FAMILY        0x6U
#define  ZEN_FAMILY       0x17

/* Intel P6 */
#define BROADWELL            0x3DU
#define SKYLAKE1             0x4EU

/* AMD */
#define ZEN2_RYZEN      0x31
[...]

The IDs with FAMILY correspond to cpu family and the Intel P6 and AMD sections to model. Please use reasonable abbreviations.

With this information, the chip can be identified but LIKWID adds some more data for the user like chip architecture description and a short name. The description is only used for output but the short name is used later in the performance monitoring part. Both information has to be added to src/topology.c. Snippet:

[...]
static char* broadwell_str = "Intel Core Broadwell processor";
static char* skylake_str = "Intel Skylake processor";
[...]
static char* amd_zen2_str = "AMD K17 (Zen2) architecture";
[...]
static char* short_broadwell = "broadwell";
static char* short_skylake = "skylake";
[...]
static char* short_zen2 = "zen2";

So add a nice string here and a short name. If the vendor publishes a short name for the chip, please use them. Intel provides long and short names like Intel Cascadelake X and CLX. I failed at using them, so please, be better than me and use official names ;)

The file src/topology.c contains a function topology_setName() which contains a set of switch-case statements based on the IDs we have added to src/includes/topology.h before. Search for P6 (Intel) or ZEN_FAMILY because the function is quite long. There you add the description and short name for the new chip. Here is a snippet of it:

switch ( cpuid_info.family )
{
    case P6_FAMILY:
        switch (cpuid_info.model)
        {
            case BROADWELL:
                cpuid_info.name = broadwell_str;
                cpuid_info.short_name = short_broadwell;
                break;
            [...]
        }
        break;
    [...]
}

With these settings, you should be able to run likwid-topology and get proper output (except cache information). The cache and NUMA information is provided by hwloc and should "just work" for x86_64 systems.

Adding support for performance monitoring

In order to get performance monitoring support for your chip, three new files are required:

  • perfmon_<name>.h: Main header for the chip
  • perfmon_<name>_counters.h: Counter/register definitions
  • perfmon_<name>_events.txt: Event definitions

No underscores or similar allowed in name but please name them reasonably to find it again later. If there are multiple versions of the chip with different counters and/or events (like Intel Broadwell desktop class broadwell, single-socket server class broadwellD and multi-socket server class broadwellEP), please put the logic for all chips in the main header e.g. perfmon_broadwell.h and separate files for counters and/or events.

Counter/register definitions

The first file should be perfmon_<name>_counters.h. It commonly consists of 3 to 4 tables: list of counters, list of units (set of counters using the same device) and a table which maps LIKWID types to the perf_event units and the device information. For the direct and daemon access method, the registers should contain proper MSR and PCI register offsets in the device. In case of two 32 bit registers as one 64 bit register, there are two COUNTER_REG slots. Best practice is to define register names in src/includes/registers.h and use the named registers here. The first entry in the following tables is the template, the second line an example:

  • List of counters:
#define NUM_COUNTERS_<UPPERCASE_NAME> X

static RegisterMap <name>_counter_map[NUM_COUNTERS_<UPPERCASE_NAME>] = {
    {COUNTERNAME, UNIQUE_ID, UNIT, CONFIG_REG, COUNTER_REG1, COUNTER_REG2, DEVICE_ID, OPTION_MASK, }
    {"PMC0", PMC0, PMC, 0x0, 0x0, 0, 0, 0x0},
    // for perf_event only the COUNTERNAME, UNIQUE_ID and UNIT are of interest
    // for direct and daemon access, the register offsets are required
    //  if there is only a counter, like free-running counter, the CONFIG_REG is not required
};

The COUNTERNAME can be chosen freely but the naming should somehow reflect their use-case. The OPTION_MASK is used if a counter has some specific extensions like thresholds or feature-switches.

  • List of units:
static BoxMap <name>_box_map[NUM_UNITS] = {
    [UNITNAME] = {CONTROL_REG, STATUS_REG, CLEAR_REG, STATUS_REG_OFFSET, IS_PCI, DEVICE_ID, COUNTER_WIDTH}
    [PMC] = {0, 0, 0, 0, 0, 0, 48},
    // for perf_event only the COUNTER_WIDTH is of interest
    // for direct and daemon access, the register offsets are required
};

The UNITNAMEs are defined in src/includes/register_types.h. Adding new types is not recommended.

  • Translation map:
static char* <name>_translate_types[NUM_UNITS] = {
    [UNITNAME] = "path_to_perf_event_directory_containing_the_'type'_file_and_'format'_folder", 
    [PMC] = "/sys/bus/event_source/devices/cpu",
};

There is a default_translate_types (src/perfmon.c) list with basic settings. The list here is only required if the types differ from the default.

  • Device list: This list defines like the access device and is therefore only required for the direct and daemon access methods. The device names in [] are listed in src/includes/pci_types.h.
static PciDevice <name>_pci_devices[MAX_NUM_PCI_DEVICES] = {
 [MSR_DEV] = {NODEVTYPE, "", "MSR", ""}, // line should be always present

 [PCI_HA_DEVICE_0] = {HA, "12.1", NULL, NULL, 0x2f30},
 [DEVICE_NAME] = {DEVICE_TYPE, "DEVICE_FILENAME", NULL, NULL, PCI_DEVICE_ID},

LIKWID tries to find the devices using the DEVICE_FILENAME (like /proc/bus/pci/7f/12.1) and the PCI_DEVICE_ID (like /sys/bus/pci/devices/0000\:7f\:12.1/device -> 0x2f30). There are commonly one PCI bus for a socket (like 0x7f in the last two example paths).

Event definitions

The most tedious work when adding a new chip is typing down/copying/parsing the list of supported events. The list of events is a plain text file and transformed into a header during compilation.

The format for the events is fixed:

EVENT_<EVENTNAME> <EVENT_ID> <USABLE_COUNTERS>
UMASK_<EVENTNAME_SUBEVENT1> <UMASK>
UMASK_<EVENTNAME_SUBEVENT2> <UMASK> <CFGBITS> <THRES>

The <CFGBITS> and <THRES> are not required but can be used to enhance the counter options.

An example for an Intel event LD_BLOCKS_STORE_FORWARD and LD_BLOCKS_NO_SR:

EVENT_LD_BLOCKS                 0x03  PMC
UMASK_LD_BLOCKS_STORE_FORWARD   0x02
UMASK_LD_BLOCKS_NO_SR           0x08

The <USABLE_COUNTERS> is compared to the counter names and only the beginning has to match, so PMC matches for PMC0, PMC1, ... It depends how you named the counters in perfmon_<name>_counters.h's list of counters.

Events with additional options

If an event provides an additional option that is not already specified in the OPTION_MASK in the counters definition, you can extend the option mask for the event like this:

EVENT_OFFCORE_RESPONSE_1                            0xBB PMC
OPTIONS_OFFCORE_RESPONSE_1_OPTIONS                  EVENT_OPTION_MATCH0_MASK
UMASK_OFFCORE_RESPONSE_1_OPTIONS                    0x01

For the event OFFCORE_RESPONSE_1_OPTIONS the user can use all options provided by the PMC counter and additionally the EVENT_OPTION_MATCH0 option. The OPTIONS_ line needs to be ahead of the UMASK_ line.

Events with default option values

Some events require some options having a specific value to properly record the execution events. You can set default values for all options provided by the counter (and event):

EVENT_MACHINE_CLEARS                    0xC3  PMC
DEFAULT_OPTIONS_MACHINE_CLEARS_COUNT    EVENT_OPTION_THRESHOLD=0x01,EVENT_OPTION_EDGE=1
UMASK_MACHINE_CLEARS_COUNT              0x01
UMASK_MACHINE_CLEARS_CYCLES             0x01

The <UMASK> value for MACHINE_CLEARS_CYCLES and MACHINE_CLEARS_COUNT is the same, but in order to get the count, two default option values are required (EVENT_OPTION_THRESHOLD and EVENT_OPTION_EDGE). The accepted values are hexadecimal. This is similar to run:

$ likwid-perfctr -C 0 -g MACHINE_CLEARS_CYCLES:PMC0,MACHINE_CLEARS_CYCLES:PMC1:THRES=0x01:EDGEDETECT ...

If the event adds another option, the OPTIONS_ line must be before the DEFAULT_OPTIONS_ line.

Main header file perfmon_<name>.h

For X86 the main header file can be quite large due to different units. It contains the code to programm the registers directly, starting them, stopping with overflow checks and reading of counter registers. The file requires at least 6 functions for initialization, setup, activation, deactivation, reading and finalizing the support.

#include <topology.h>
#include <access.h>
#include <error.h>
#include <affinity.h>

#include <perfmon_<name>_events.h>
#include <perfmon_<name>_counters.h>

static int perfmon_numCounters<UPPERCASE_NAME> = NUM_COUNTERS_<UPPERCASE_NAME>;
static int perfmon_numArchEvents<UPPERCASE_NAME> = NUM_ARCH_EVENTS_<UPPERCASE_NAME>;

int perfmon_init_<name>(int cpu_id)
{
    // Acquire locks for the hardware thread cpu_id
    // Determine which hardware thread is responsible for a CPU core
    lock_acquire((int*) &tile_lock[affinity_thread2core_lookup[cpu_id]], cpu_id);
    // Determine which hardware thread is responsible for a CPU socket
    lock_acquire((int*) &socket_lock[affinity_thread2socket_lookup[cpu_id]], cpu_id);
    // There are different locks available. Before using them in the setup, start, stop, read or
    // finalize function, you have to acquire them here


    // Do other initialization work like setting some registers to zero or set function pointers
    // for later use.
}

int perfmon_setupCounterThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
    // thread_id is the offset in the CPUset, so get the cpu_id
    int cpu_id = groupSet->threads[thread_id].processorId;
    // Check whether cpu_id is responsible for the socket
    int haveLock = 0;
    if (socket_lock[affinity_thread2socket_lookup[cpu_id]] == cpu_id)
    {
        haveLock = 1;
    }

    for (int i=0;i < eventSet->numberOfEvents;i++)
    {
        // skip non-existing counters (device not available, ...)
        RegisterType type = eventSet->events[i].type;
        if (!TESTTYPE(eventSet, type))
        {
            continue;
        }
        // Index in counter_map (in perfmon_<name>_counters.h)
        RegisterIndex index = eventSet->events[i].index;
        // Event configuration (struct generated from perfmon_<name>_events.txt)
        PerfmonEvent *event = &(eventSet->events[i].event);
        // Access device
        PciDeviceIndex dev = counter_map[index].device;

        // configure event for counter at device
        // the current implementation uses a big switch-case here:
        switch (type)
        {
            case PMC:
                // configure event for PMC counter at MSR_DEV
                break;
            default:
                break;
        }

        // Mark the successful setup
        eventSet->events[i].threadCounter[thread_id].init = TRUE;
    }
    return 0;
}

int perfmon_startCountersThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
    // get cpu_id and lock status as in setup function

    for (int i=0;i < eventSet->numberOfEvents;i++)
    {
        if (eventSet->events[i].threadCounter[thread_id].init == TRUE)
        {
            // get type, index, event and dev as in setup function
            eventSet->events[i].threadCounter[thread_id].startData = 0;

            // start event for counter at device
            // if you cannot start/stop a counter, read the current value and store it in
            // eventSet->events[i].threadCounter[thread_id].startData

            eventSet->events[i].threadCounter[thread_id].counterData = eventSet->events[i].threadCounter[thread_id].startData;
        }
    }
    // commonly here you do something to start the units
    return 0;
}

int perfmon_stopCountersThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
    // get cpu_id and lock status as in setup function

    // commonly here you do something to stops the units and consequently also all counters

    for (int i=0;i < eventSet->numberOfEvents;i++)
    {
        if (eventSet->events[i].threadCounter[thread_id].init == TRUE)
        {
            // get type, index, event and dev as in setup function
            uint64_t raw_value = 0;
            // read the counter at dev into raw_value

            // store the value truncated to the counter width defined in perfmon_<name>_counter.h
            eventSet->events[i].threadCounter[thread_id].counterData field64(raw_value, 0, box_map[type].regWidth);
        }
    }
    return 0;
}

int perfmon_readCountersThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
    // The read function is sometimes only a combination of the start and stop function but 
    // might be used to provide low overhead reads without pausing the measurements
    // The main difference is that the start/stop function commonly resets the counter register to
    // zero and with read, we want to keep it counting.

    // get cpu_id and lock status as in setup function

    // Maybe stop the counters and save current settings

    for (int i=0;i < eventSet->numberOfEvents;i++)
    {
        if (eventSet->events[i].threadCounter[thread_id].init == TRUE)
        {
            // get type, index, event and dev as in setup function
            uint64_t raw_value = 0;
            // read the counter at dev into raw_value

            // store the value truncated to the counter width defined in perfmon_<name>_counter.h
            eventSet->events[i].threadCounter[thread_id].counterData field64(raw_value, 0, box_map[type].regWidth);
        }
    }

    // Maybe restart counters with previous settings

    return 0;
}

int perfmon_finalizeCountersThread_<name>(int thread_id, PerfmonEventSet* eventSet)
{
    // get cpu_id and lock status as in setup function
    
    for (int i=0;i < eventSet->numberOfEvents;i++)
    {
        // get type, index, event and dev as in setup function

        // reset config and counter register(s) to zero, revert all done changes.
    }
    return;
}

Registering chip in performance monitoring module

The performance module is defined in src/perfmon.c.

At first, add the main header: #include <perfmon_<name>.h>. The next step is comparable to the topology_setName() function. The name of the function is perfmon_init_maps() and also contains a set of nested switch-case statements. Search for P6 or ZEN as the function is quite long. Here we register for a CPU family and model the lists/tables we have defined before in perfmon_<name>_counters.h and perfmon_<name>_counters.h.

switch ( cpuid_info.family )
{
    [...]
    case P6_FAMILY:
        switch ( cpuid_info.model)
        {
            case BROADWELL:
                eventHash = broadwell_arch_events; // <name>_arch_events generated at compilation
                perfmon_numArchEvents = perfmon_numArchEventsBroadwell; // defined by you in perfmon_<name>.h
                perfmon_numCounters = perfmon_numCountersBroadwell; // defined by you in perfmon_<name>.h
                counter_map = broadwell_counter_map; // <name>_counter_map defined in perfmon_<name>_counters.h
                box_map = broadwell_box_map; // <name>_box_map defined in perfmon_<name>_counters.h
                translate_types = broadwell_translate_types; // <name>_translate_types defined in perfmon_<name>_counters.h
                break;
           [...]
        }
    [...]
}

Moreover, we need to register the functions from the main header to be called for the architecture. The function perfmon_init_funcs() is similar to the perfmon_init_maps() with a big switch-case statement.

switch ( cpuid_info.family )
{
    [...]
    case P6_FAMILY:
        switch ( cpuid_info.model)
        {
            // the functions work for different models. The lists in perfmon_init_maps() differ
            // but the logic is the same
            case BROADWELL:
            case BROADWELL_E:
            case BROADWELL_D:
            case BROADWELL_E3:
                // call power_init() for that architecture to initialize the energy module
                initialize_power = TRUE;
                // call thermal_init() for that architecture to initialize the thermal module
                initialize_thermal = TRUE;
                // register the six function from src/include/perfmon_<name>.h
                initThreadArch = perfmon_init_broadwell;
                perfmon_startCountersThread = perfmon_startCountersThread_broadwell;
                perfmon_stopCountersThread = perfmon_stopCountersThread_broadwell;
                perfmon_readCountersThread = perfmon_readCountersThread_broadwell;
                perfmon_setupCountersThread = perfmon_setupCounterThread_broadwell;
                perfmon_finalizeCountersThread = perfmon_finalizeCountersThread_broadwell;
                break;
        }
        break;
}

Registering chip in access daemon

The above steps should enable the performance monitoring for perf_event and the direct access mode. The access daemon requires some more work because it is a different code to reduce the code base. The access daemon is in src/access-daemon/accessDaemon.c. Similar to the previous steps, you have to register the architecture and specify a function to do access checks for the commands received from the library.

Similar to the previous registrations, the setting is done based on the cpu family and model. Search for P6 to find the switch-case block.

switch (family)
{
    case P6_FAMILY:
        allowed = allowed_intel;

        if (model == BROADWELL)
        {
            allowed = allowed_broadwell;
        }
        break;
}

There are some flags that can be set as well like:

  • isPCIUncore: The access daemon tries to find and load the PCI devices. You should also set the allowed_pci(PciDeviceType type, uint32_t reg) function pointer to an appropriate access checker function.
  • isClientMem: The access daemon tries to load the desktop class memory controllers of Intel CPUs

For isPCIUncore, you need to specify which PCI devices should be used. There is a bigger if-elseif-else block and you can reuse the PCI devices list in perfmon_<name>_counters.h for that: pci_devices_daemon = broadwelld_pci_devices. Don't forget to include the counters header in the access daemon.

In the access checker functions, you can use the register names from src/includes/registers.h.

Further x86 related modules in LIKWID

There are some other modules in LIKWID which provide functionality for x86 systems. Previously named were the power and thermal module which are also used for performance monitoring but also other modules that are not related to that like prefetcher or CPU frequency manipulations.

Power module

Intel (since SandyBridge) and AMD (since Zen) support the so-called RAPL interface which provides energy consumption measurements. The vendors use the measurements internally for management reasons like hardware-guided power constraints etc. If the system is a successor of previous Intel or AMD systems, check src/power.c for switch-case blocks and add your cpu family/model combination there.

Thermal module

The thermal module is currently provided only by Intel architectures although the feature should be available for AMD as well. You don't have to change anything in src/thermal.c, just set the initialize_power flag in src/perfmon.c:perfmon_init_funcs().

Finalization

Internal list of supported chip architectures

As a final step, add your chip to the print_supportedCPUs() function in src/topology.c

Add it to README.md and add comment in CHANGELOG

Add the new chip to README.md in section https://github.com/RRZE-HPC/likwid/blob/master/README.md#supported-architectures. Add a line in CHANGELOG like Support for VENDOR MODEL (LIST_OF_SUPPORTED_UNITS).

Create counter tables in wiki and Doxygen

The LIKWID wiki contains one page per supported architecture with tables of available counters, restrictions and further information. Unfortunately, I had to use HTML tables instead of Markdown tables. Copy one already existing ARM architecture file to get the structure and add all information. If you also used the HTML table syntax, you can copy the tables into the Doxygen documentation in `doc/arch/.md. Use the basic layout as the other architecture files.

Clone this wiki locally