# Performance Counters

CFiddle provide easy access to hardware performance counters to count things like cache misses and branch mispredictions.

<div class="alert alert-block alert-warning">

In order for performance counters to work, you need access to your hardware's performance counters.   You can check the [perf_event_open man page](https://man7.org/linux/man-pages/man2/perf_event_open.2.html) for details about how to enable the `perf_events` interface on your system (It's usually turned on by default).

If you're in docker, you'll also need to start the container with `--privileged`.
    
</div>

Let's use to investigate the performance difference between `std::ordered_set` and `std::set` in the C++ STL.

## The Code

The code provides two functions that each fill a set with random integers.  We'll compile it with full optimizations.

In [None]:
%xmode Minimal
from cfiddle import *
from cfiddle.perfcount import *


In [None]:
exe = build(code(r"""
#include<set>
#include<unordered_set>
#include"cfiddle.hpp"

extern "C"
int build_set(int count) {
    std::set<uint64_t> s;
    uint64_t seed = 0xDEADBEEF;
    start_measurement();
    for(int i= 0; i < count; i++) {
        s.insert(fast_rand(&seed));
    }
    end_measurement();
    return s.size();
}

extern "C"
int build_unordered_set(int count) {
    std::unordered_set<uint64_t> s;
    uint64_t seed = 0xDEADBEEF;
    start_measurement();
    for(int i= 0; i < count; i++) {
        s.insert(fast_rand(&seed));
    }
    end_measurement();
    return s.size();
}

"""), arg_map(OPTIMIZE="-O3"))

## Measuring Cache Misses and Instructions Executed

Here's the command to run the program and measure performance counters:

In [None]:
results = run(exe, 
              ["build_set", "build_unordered_set"], 
              arg_map(count=exp_range(8,16*1024*1024,2)), 
              perf_counters=["PERF_COUNT_HW_INSTRUCTIONS", "PERF_COUNT_HW_CACHE_L1D:READ:MISS"])

The key is the `perf_counters` parameter which takes a list of performance counters to measure.  CFiddle supports all the hardware, software, and cache counters described in the [perf_event_open() man page](https://man7.org/linux/man-pages/man2/perf_event_open.2.html).  In this case, we'll count the number of level-1 data cach (`L1D`), read misses and the total number instructions executed.

We use Pandas data from opererations to compute some dervived metrics:

In [None]:
r = results.as_df()
r['L1_MissPerInsert'] = r["PERF_COUNT_HW_CACHE_L1D:READ:MISS"]/r["count"]
r['InstPerInsert'] = r["PERF_COUNT_HW_INSTRUCTIONS"]/r["count"]
r['L1_MPI'] = r["PERF_COUNT_HW_CACHE_L1D:READ:MISS"]/r["PERF_COUNT_HW_INSTRUCTIONS"]
display(r)

And then we can reshuffle that data to make comparisons and plotting easier:

In [None]:
import pandas as pd
pt = pd.pivot_table(r, index="count", values=["InstPerInsert", "L1_MissPerInsert", "ET"], columns="function")
display(pt)

## The Results

The results provide some insight into why `std::ordered_set` is roughly twice as fast at as `std::set` for inserts: While the number of instructions per access grows pretty slowly, the number _cache misses_ per accesses grows much faster for `std::set`  than `std::ordered_set`.

In [None]:
pt.plot.line(y=[("ET", "build_unordered_set"), ("ET", "build_set")], ylim=(0,10), ylabel="ET")

In [None]:
pt.plot.line(y=[("InstPerInsert", "build_unordered_set"), ("InstPerInsert", "build_set")], ylim=(0,520), ylabel="Instruction Per Insert")

In [None]:
pt.plot.line(y=[("L1_MissPerInsert", "build_unordered_set"), ("L1_MissPerInsert", "build_set")], ylim=(0,50), ylabel="Misses per Insert")