# Homework4

Words before the real parts beginning:
- `CMakeLists.txt` enables the google tests and the compiler Sanitizer. But I use this cmake **only in delevepment**, so not for evaluation.
- As said, Unit tests (google tests) can be found in: `Tester.cpp`, `TestAggregation.cpp`, `TestDictionaryCompression.cpp`. Use cmake to build them and run them.
- I did some basic c++ code refactoring but not complete. The code style is still not cool, such as functions defined in the header files.
- I only use the given makefile for evaluation. For aforementioned reasons, I did tiny change to makefile to fit my code. But the change is not performance-critical at all.
- I did not cover all the (annoying) edge cases of SIMD programming, thus as to load aligned memory address, scalar code for the very remaining part. The reason I did hard coded them, is just I saw some comments hinting us to make some assumptions about the datasets. I have taken care of some edge cases, but I still believe my implementation can't handle arbitrary size of datasets.

# Aggregation

We are simulating following SQL in C++. We only have to take care of the counting number, so no materialization, data copying.

```sql
SELECT COUNT(*)
FROM R
WHERE R.a > 42
```

The dataset is conducted into two cases: `int8_t` and `int64_t`. For each case we perform 3 types of counting: trivial counting, branch-free counting, SIMD counting:
- `int8_t`
    - count8
    - count8BrFree
    - count8SIMD
- `int64_t`
    - count64
    - count64BrFree
    - count64SIMD

## 1. Compiler Auto-Vectorization on different optimization flags

GCC 9.2.0 for count8`:
- `-O0`: Trivial code and with branch.
- `-O1` and `-O2`:
    - https://compiler.db.in.tum.de/z/tapZ2e and https://compiler.db.in.tum.de/z/S65gHh
    - We can see from the generated assembly, that the code is optimized into a branch-free version. In addition, these two flags differ not so much.
    - Zoom in: 
    ```asm
        cmp     BYTE PTR [rcx], dl
        setl    sil
        movzx   esi, sil
        add     eax, esi
    ```
    - Interpretation of the `Zoom in`: The addition operation doesn't care about the result of `cmp` and **just add anyway**.
- `-O3`:
    - https://compiler.db.in.tum.de/z/wgntZh
    - The generated assembly is much much sophisticated and not generally readable. But easy to find `xmm` such registers, which means the code is SIMD styled. So GCC is assumed that all the CPU has `xmm` and corresponding instructions.
- `-O3 -fno-tree-vectorize`
    - https://compiler.db.in.tum.de/z/Y0U5Au
    - `-fno-tree-vectorize` can avoid the Compiler Auto-Vectorization. So the generated assembly is again readable. Without a surprise, the code is branch-free. An additional improvement is instruction reordering: the loop counter increment `add     rdi, 1` is reordered into the predicate evaluation part. As learned in the lecture, so this approach can reduce the **data hazard**, so higher IPC is possible.
- `-O3 -march=skylake-avx512`
    - https://compiler.db.in.tum.de/z/TipY86
    - Giving more hint to the compiler about the CPU instructions, so it can generates code with `ymm`. So a step further into new SIMD instructions. But sill not with `zmm`.
 
I did check the Clang compiler. But nothing new and not so different as the results of GCC. So I do not duplicate the part of Clang. In addition, I also do not duplicate with `count64`.

<br>
<br>

---

## 2. Performance Comparison

- compile with `-O3`
- `static constexpr unsigned chunkSize = 32 * 1024;`

### `count8` with varying selectivity

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|   1|     0.01 | 37967780 |  75753929 | 298264 |    1186.00 |       4470.00 | 8460367.00| 2.00 | 4.49 |
|  10|     0.01 | 37944228 |  75759726 | 298640 |      44.00 |       4373.00 | 9296856.00| 2.00 | 4.08 |
|  50|     0.01 | 37929608 |  75753652 | 297388 |       2.00 |       4284.00 | 9293465.00| 2.00 | 4.08 |
|  90|     0.01 | 37929970 |  75753431 | 297536 |       2.00 |       4271.00 | 9292979.00| 2.00 | 4.08 |
|  99|     0.01 | 37929996 |  75757288 | 297449 |      13.00 |       4302.00 | 9293348.00| 2.00 | 4.08 | 

### `count8BrFree` with varying selectivity

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|  1 |     0.01 | 37937768.00 |  75754360.00 |  248891.00 |     141.00 |       4371.00 |  9286812.00 | 2.00 | 4.09 |
| 10 |     0.01 | 37936964.00 |  75756120.00 |  250062.00 |      31.00 |       4329.00 |  8551862.00 | 2.00 | 4.44 |
| 50 |     0.01 | 37947498.00 |  75757929.00 |  248265.00 |      29.00 |       4302.00 |  8683809.00 | 2.00 | 4.37 |
| 90 |     0.01 | 37933641.00 |  75754000.00 |  248694.00 |       7.00 |       4272.00 |  8570374.00 | 2.00 | 4.43 |
| 99 |     0.01 | 37936553.00 |  75751117.00 |  248884.00 |       2.00 |       4252.00 |  8570310.00 | 2.00 | 4.43 |  

### `count8SIMD` with varying selectivity

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|  1 |     0.00 | 5646344.00 |  19043315.00 |  256755.00 |      38.00 |       4212.00 |  1383754.00 | 3.37 | 4.08 |
| 10 |     0.00 | 5676919.00 |  19049035.00 |  256416.00 |      23.00 |       4242.00 |  1426374.00 | 3.36 | 3.98 |
| 50 |     0.00 | 5665713.00 |  19048964.00 |  258650.00 |       9.00 |       4226.00 |  1388184.00 | 3.36 | 4.08 |
| 90 |     0.00 | 5654067.00 |  19043315.00 |  256613.00 |       1.00 |       4199.00 |  1385808.00 | 3.37 | 4.08 |
| 99 |     0.00 | 5643771.00 |  19043315.00 |  256614.00 |       1.00 |       4196.00 |  1418537.00 | 3.37 | 3.98 |  

<br>

The only thing that is different among selectivities is the **LLC-misses** within each function-performance-table. I can't fully understand what is the reason for that. First of all, I think it is non-trivial. But we are handling the experiment variable **selectivity** with all other variables same controlled. But the code with `-O3` is already branch-free. I try to build a 1 GiB dataset to see, if the things can change and last the program running more time.

### `count8` with varying selectivity with 1GiB

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|           1|     0.07| 303026454.00| 605887954.00| 2561410.00|    1081.00|      33475.00| 67607083.00| 2.00| 4.48 | 
|          10|     0.07| 303079169.00| 605903523.00| 2510285.00|      64.00|      33459.00| 69049179.00| 2.00| 4.39 | 
|          50|     0.07| 303024145.00| 605904251.00| 2492105.00|      24.00|      33426.00| 69104967.00| 2.00| 4.38 | 
|          90|     0.07| 303041225.00| 605903970.00| 2517845.00|      45.00|      33400.00| 70093723.00| 2.00| 4.32 | 
|          99|     0.07| 302975832.00| 605898334.00| 2465491.00|      12.00|      33342.00| 67570932.00| 2.00| 4.48 | 

### `count8BrFree` with varying selectivity with 1GiB

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|           1|     0.07| 303069403.00| 605904571.00| 2370574.00|     162.00|      33569.00| 70088179.00| 2.00| 4.32 | 
|          10|     0.07| 302953587.00| 605897569.00| 2370558.00|      43.00|      33391.00| 67589821.00| 2.00| 4.48 | 
|          50|     0.07| 303010251.00| 605898885.00| 2370657.00|      20.00|      33374.00| 68960633.00| 2.00| 4.39 | 
|          90|     0.07| 302940150.00| 605895134.00| 2370447.00|      11.00|      33297.00| 67517510.00| 2.00| 4.49 | 
|          99|     0.07| 302998359.00| 605895551.00| 2370466.00|      45.00|      33357.00| 70757769.00| 2.00| 4.28 |  

### `count8SIMD` with varying selectivity with 1GiB

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|           1|     0.01| 44026985.00| 152235981.00| 2377856.00|      30.00|      32948.00| 9969621.00|   3.46| 4.42 | 
|          10|     0.01| 43959079.00| 152243450.00| 2435882.00|      19.00|      32972.00| 10797501.00|  3.46| 4.07 | 
|          50|     0.01| 43981193.00| 152243436.00| 2446258.00|       5.00|      32966.00| 10649242.00|  3.46| 4.13 | 
|          90|     0.01| 44287390.00| 152243371.00| 2377643.00|       3.00|      32956.00| 10860073.00|  3.44| 4.08 | 
|          99|     0.01| 44206670.00| 152236117.00| 2390787.00|       6.00|      32933.00| 10125350.00|  3.44| 4.37 |  


We observe same pattern within the each function's cases. For now, we see the branch-misses are a constant as well as the IPC, which results in a same run time for each function's cases. I can still no explain why the LLC or the memory access pattern differ, when the selectivity changes. I want to say something about pre-fetching, but it is still against the imagination in my mind. **I am not capable to solve it and conclude it into TODO&Question part.** 

<br>
<br>

### `count64` with varying selectivity

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|           1|     0.02| 74598307.00| 151106622.00| 16787016.00|   19264.00|       2075.00| 18561840.00| 2.03| 4.02 |
|          10|     0.02| 74337775.00| 151093901.00| 16786607.00|      20.00|       1995.00| 18481213.00| 2.03| 4.02 |
|          50|     0.02| 77126731.00| 151102084.00| 16786862.00|      10.00|       2004.00| 17790253.00| 1.96| 4.34 |
|          90|     0.02| 76202818.00| 151093920.00| 16786599.00|       5.00|       1931.00| 18193870.00| 1.98| 4.19 |
|          99|     0.02| 75836536.00| 151099305.00| 16786830.00|       1.00|       1929.00| 18017938.00| 1.99| 4.21 |

### `count64BrFree` with varying selectivity

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|           1|     0.02| 75551472.00| 151102283.00| 16787897.00|    1071.00|       1608.00| 17928791.00| 2.00| 4.21 |
|          10|     0.02| 75320401.00| 151091230.00| 16787562.00|       8.00|       1558.00| 17948005.00| 2.01| 4.20 |
|          50|     0.02| 74095232.00| 151099346.00| 16787844.00|       9.00|       1575.00| 18338945.00| 2.04| 4.04 |
|          90|     0.02| 75269268.00| 151093596.00| 16787639.00|      21.00|       1566.00| 18113362.00| 2.01| 4.16 |
|          99|     0.02| 74556385.00| 151093581.00| 16787597.00|       2.00|       1566.00| 18416177.00| 2.03| 4.05 | 

### `count64SIMD` with varying selectivity

|Selectivity | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|           1|     0.02| 72763197.00| 167859379.00| 16785832.00|      45.00|       2278.00| 17557114.00| 2.31| 4.14 |
|          10|     0.02| 71748000.00| 167855643.00| 16785664.00|      23.00|       2311.00| 17694600.00| 2.34| 4.05 |
|          50|     0.02| 72700947.00| 167850905.00| 16785493.00|       8.00|       2257.00| 17288096.00| 2.31| 4.21 |
|          90|     0.02| 72470870.00| 167859045.00| 16785841.00|       4.00|       2319.00| 17197390.00| 2.32| 4.21 |
|          99|     0.02| 72723105.00| 167853672.00| 16785593.00|       4.00|       2283.00| 17385915.00| 2.31| 4.18 |

So the case with `int64_t` has a larger dataset size (8x). So more stress is put on the memory bus and generally more LLC-misses are conducted. But the pattern remains as the `int8_t` case.


---

# Dictionary Decompression

## Introduction

I did implemented a precomputed lookup table (PTB) to decode a **Byte** instead of a nibble, which is a standard SIMD programming approach. So following code shows, the size of the PTB is `sizeof(int64_t) * 256 = 1024 Bytes`, which is ideal to fit into L1 cache. It can also be some PTB better that this trivial one, but I did not go deep into this part.

```c++
constexpr std::array<int32_t, 16> nibble_dict = {100, 101, 102, 103,
                                                 200, 201, 202, 203,
                                                 300, 301, 302, 303,
                                                 400, 401, 402, 403};

constexpr std::array<int64_t, 256> fill_byte_dict() {
  std::array<int64_t, 256> byte_dict{0};
  for (size_t i = 0; i < 256; i++) {
    // Non-Trivial: Assumption: Little-Endian
    byte_dict[i] = (static_cast<int64_t>(nibble_dict[i & 0b1111]) << 32) + nibble_dict[(i >> 4) & 0b1111];
  }
  return byte_dict;
}

constexpr std::array<int64_t, 256> byte_dict = fill_byte_dict();
```

With such PTB, it is no necessary to pick out each nibble and decode individually, I can just optimize the given scalar function with this PTB, the work inside the function is greatly reduced and I can expect a much higher IPC from code helped by PTB:

```c++
void dictDecompress_BYTE(uint8_t *in, uint32_t inCount, int32_t *out) {
  int64_t *out_64 = reinterpret_cast<int64_t*>(out);
  for (uint32_t i = 0; i < inCount; i++) {
    out_64[i] = byte_dict[in[i]];  // This line is just a single copy XD
  }
}
```

Not surprisely, this PTB can be also helpful to optimize other to-be-implemented functions. To handle a single byte as an entity is good way to reduce the unnecesary bit operation. But the last function `dictDecompressPermute` is completely un-relevant with PTB, since it is required to store the dictionary in a SIMD register. Under this requirement, only the given `nibble_dict` fits into SIMD register. The PTB can only live in L1-cache, but no in a single tiny register. So the last function is coming with native bit operations.

<br>

I list the functions to be evaluted:
- dictDecompress
- dictDecompress_BYTE 
- dictDecompress8 
- dictDecompressGather 
- dictDecompressPermute

## Evaluation

- `static constexpr unsigned chunkSize = 6 * 1024;`

|Function | Runtime | Cycles | Instructions  |  L1-misses | LLC-misses  | Branch-misses   | Task-clock  | IPC  |  GHz |
|---|---|---|---|---|---|---|---|---|---|
|        dictDecompress|     0.95| 4227264116.00| 12610021897.00|  10992.00|    1235.00|    1234282.00| 945615877.00| 2.98|4.47 |
|   dictDecompress_BYTE|     0.30| 1335041763.00| 5906043128.00|   4448.00|     250.00|    1230785.00| 300948618.00|4.42| 4.44 |
|       dictDecompress8|     0.31| 1369582142.00| 3814647224.00|   6016.00|     159.00|    1231228.00| 308940631.00|2.79| 4.43 |
|  dictDecompressGather|     0.38| 1695602737.00| 3614431670.00|   5015.00|      41.00|    1231441.00| 379709796.00|2.13| 4.47 |
|  dictDecompressPermute|     0.75| 3358729931.00| 7382434310.00|   8557.00|     156.00|    1233481.00| 750297263.00| 2.20|4.48 |

The SIMD versions can optimize the runtime a lot. As said before (and refer to code) the `dictDecompressPermute` is unpacking values into nibbles. In addition the `*_permute_*` instructions need multiple load to do the work and has longer latency.

As expected, the function `dictDecompress_BYTE` dominates with top performance and ultimate IPC and L1-cache-hit-ratio. I list the assembly here: https://compiler.db.in.tum.de/z/nmSJ_Q. As said we do nothing more than data-copying, which is reflected in **only** assembly `mov` instruction. This time I don't throw more insight into SIMD versions. They are good but not optimal to solve this too-simplified mock-up problem. In a more realistic problem, we have to integrate the PTB and SIMD together to do less unnecessary runtime computation.

# TODO & Quesstion: 
1. Is the code with `zmm` better than `ymm`? To avoid the CPU to become too hot?
    - The former one use `ymm` and the last one use `zmm`, which is a surprise for me.
    - SIMD code with `-O3 -march=skylake-avx512`: https://compiler.db.in.tum.de/z/dKxLUt
    - SIMD code with `-O0 -march=skylake-avx512`: https://compiler.db.in.tum.de/z/UuA7Gb

2. Why not to generate with `zmm`, since the CPU is capable for AVX512?
    - Scalar code with `-O3 -march=skylake-avx512`: https://compiler.db.in.tum.de/z/TipY86
3. To be honest, I don't get it with the `chunkSize` variable? Why we need it? Why does it have impact on performance  or does it have?
4. The **LLC-misses** issue in the aggregation part. I feel it is just too easy to see the answer or it is something that I never know. So two extreme cases.

# Appendix

## Experiment Environment

- All the experiments are performed on a Core i9-7900X - Intel: https://en.wikichip.org/wiki/intel/core_i9/i9-7900x


## Aggregation with `-O1` and `-O2`:

I see this performance result less valuable to study. So I do not pay attention to them and list them as **raw data and back-up** (if necessary, I can process quickly them with `grep`, `sed` to generate the similar markdown table above):

- `-O1`:
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,           1,     0.06, 287754790.00, 939635722.00, 2101705.00,   29105.00,        963.00, 64318743.00,     1, 3.27, 1.00, 4.47 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,           1,     0.07, 287788349.00, 939637852.00, 2101547.00,     130.00,        904.00, 65339314.00,     1, 3.27, 1.00, 4.40 
          name, selectivity, time sec,     cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,           1,     0.00, 7275923.00,  18896781.00, 2098081.00,      31.00,        260.00, 1746311.00,     1, 2.60, 1.00, 4.17 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,          10,     0.07, 287776338.00, 939647803.00, 2101982.00,     128.00,       1079.00, 66631973.00,     1, 3.27, 1.00, 4.32 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,          10,     0.07, 287800581.00, 939655430.00, 2101790.00,     172.00,        913.00, 65015405.00,     1, 3.26, 1.00, 4.43 
          name, selectivity, time sec,     cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,          10,     0.00, 7170435.00,  18896788.00, 2098160.00,      25.00,        358.00, 1723214.00,     1, 2.64, 1.00, 4.16 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,          50,     0.07, 287854231.00, 939647118.00, 2101995.00,     175.00,        909.00, 67587460.00,     1, 3.26, 1.00, 4.26 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,          50,     0.07, 287958053.00, 939643469.00, 2101858.00,      81.00,        872.00, 66568401.00,     1, 3.26, 1.00, 4.33 
          name, selectivity, time sec,     cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,          50,     0.00, 7190618.00,  18896764.00, 2098075.00,       8.00,        274.00, 1726542.00,     1, 2.63, 1.00, 4.16 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,          90,     0.07, 287768579.00, 939639849.00, 2101758.00,      32.00,        866.00, 66975877.00,     1, 3.27, 1.00, 4.30 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,          90,     0.06, 287550772.00, 939635535.00, 2101408.00,      25.00,        779.00, 64607587.00,     1, 3.27, 1.00, 4.45 
          name, selectivity, time sec,     cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,          90,     0.00, 7308862.00,  18904281.00, 2098406.00,      24.00,        302.00, 1825409.00,     1, 2.59, 1.00, 4.00 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,          99,     0.07, 287739603.00, 939648638.00, 2102023.00,      64.00,       1011.00, 66763662.00,     1, 3.27, 1.00, 4.31 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,          99,     0.06, 287782839.00, 939637066.00, 2101378.00,      24.00,        828.00, 64074377.00,     1, 3.27, 1.00, 4.49 
          name, selectivity, time sec,     cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,          99,     0.00, 7120195.00,  18904233.00, 2098407.00,      14.00,        307.00, 1779185.00,     1, 2.66, 1.00, 4.00 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,           1,     0.07, 291492978.00, 939676261.00, 16787623.00,   19164.00,       1657.00, 66813087.00,     1, 3.22, 1.00, 4.36 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,           1,     0.07, 291257921.00, 939677404.00, 16788597.00,    1385.00,       2286.00, 66395047.00,     1, 3.23, 1.00, 4.39 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,           1,     0.02, 65977353.00, 167856185.00, 16785407.00,      27.00,       2312.00, 15674289.00,     1, 2.54, 1.00, 4.21 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,          10,     0.07, 291207280.00, 939672614.00, 16787532.00,      16.00,       1618.00, 65481997.00,     1, 3.23, 1.00, 4.45 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,          10,     0.07, 291096892.00, 939675160.00, 16788553.00,      12.00,       2125.00, 67539328.00,     1, 3.23, 1.00, 4.31 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,          10,     0.02, 67047079.00, 167856207.00, 16785389.00,       7.00,       2285.00, 15028130.00,     1, 2.50, 1.00, 4.46 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,          50,     0.07, 291071616.00, 939672641.00, 16787555.00,      23.00,       1599.00, 67395253.00,     1, 3.23, 1.00, 4.32 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,          50,     0.06, 291313468.00, 939679161.00, 16788604.00,      34.00,       2445.00, 64860159.00,     1, 3.23, 1.00, 4.49 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,          50,     0.02, 66057946.00, 167857828.00, 16785466.00,       7.00,       2362.00, 16155568.00,     1, 2.54, 1.00, 4.09 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,          90,     0.06, 291162005.00, 939676496.00, 16787697.00,      20.00,       1668.00, 64833842.00,     1, 3.23, 1.00, 4.49 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,          90,     0.07, 291741105.00, 939795743.00, 16793996.00,    1445.00,       3812.00, 66927604.00,     1, 3.22, 1.00, 4.36 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,          90,     0.02, 67541670.00, 168106513.00, 16795301.00,    1115.00,       3469.00, 15658681.00,     1, 2.49, 1.00, 4.31 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,          99,     0.07, 291419705.00, 939722791.00, 16789983.00,     320.00,       2228.00, 67215480.00,     1, 3.22, 1.00, 4.34 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,          99,     0.07, 291210544.00, 939674509.00, 16788492.00,      19.00,       2639.00, 65623097.00,     1, 3.23, 1.00, 4.44 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,          99,     0.02, 66126737.00, 167856162.00, 16785401.00,      15.00,       2290.00, 16171263.00,     1, 2.54, 1.00, 4.09 
    
 - `-O2`

          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,           1,     0.06, 257698663.00, 939628827.00, 2101514.00,   29362.00,        878.00, 57267736.00,     1, 3.65, 1.00, 4.50 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,           1,     0.06, 257619479.00, 939631538.00, 2101497.00,     149.00,        876.00, 59978447.00,     1, 3.65, 1.00, 4.30 
          name, selectivity, time sec,      cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,           1,     0.00, 11235407.00,  18896506.00, 2098233.00,      31.00,        373.00, 2625342.00,     1, 1.68, 1.00, 4.28 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,          10,     0.06, 257635474.00, 939637009.00, 2101738.00,      69.00,        845.00, 60311466.00,     1, 3.65, 1.00, 4.27 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,          10,     0.06, 257479421.00, 939630322.00, 2101369.00,      52.00,        785.00, 58535684.00,     1, 3.65, 1.00, 4.40 
          name, selectivity, time sec,      cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,          10,     0.00, 11301605.00,  18903991.00, 2098555.00,      43.00,        542.00, 2660121.00,     1, 1.67, 1.00, 4.25 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,          50,     0.06, 257589881.00, 939634320.00, 2101539.00,      51.00,        874.00, 60225943.00,     1, 3.65, 1.00, 4.28 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,          50,     0.06, 257629142.00, 939632597.00, 2101384.00,      38.00,        797.00, 57373909.00,     1, 3.65, 1.00, 4.49 
          name, selectivity, time sec,      cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,          50,     0.00, 10925578.00,  18904045.00, 2098542.00,      17.00,        405.00, 2707304.00,     1, 1.73, 1.00, 4.04 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,          90,     0.06, 257659199.00, 939633177.00, 2101532.00,      24.00,        829.00, 58563195.00,     1, 3.65, 1.00, 4.40 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,          90,     0.06, 257458904.00, 939626627.00, 2101204.00,      16.00,        751.00, 58397800.00,     1, 3.65, 1.00, 4.41 
          name, selectivity, time sec,      cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,          90,     0.00, 10766972.00,  18896506.00, 2098255.00,      15.00,        358.00, 2668497.00,     1, 1.76, 1.00, 4.03 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count8,          99,     0.06, 257608222.00, 939626854.00, 2101261.00,      36.00,        728.00, 58701274.00,     1, 3.65, 1.00, 4.39 
          name, selectivity, time sec,       cycles, instructions,  L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count8BrFree,          99,     0.06, 257316207.00, 939628965.00, 2101279.00,      66.00,        882.00, 58865193.00,     1, 3.65, 1.00, 4.37 
          name, selectivity, time sec,      cycles, instructions,  L1-misses, LLC-misses, branch-misses, task-clock, scale,  IPC, CPUs,  GHz 
    count8SIMD,          99,     0.00, 11226295.00,  18896489.00, 2098234.00,       7.00,        360.00, 2623890.00,     1, 1.68, 1.00, 4.28 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,           1,     0.06, 274453210.00, 939886928.00, 16798913.00,   21284.00,       4585.00, 64520744.00,     1, 3.42, 1.00, 4.25 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,           1,     0.06, 273671301.00, 939666792.00, 16789239.00,     549.00,       2587.00, 60938102.00,     1, 3.43, 1.00, 4.49 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,           1,     0.02, 78031836.00, 151078354.00, 16785328.00,      48.00,       1285.00, 18431284.00,     1, 1.94, 1.00, 4.23 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,          10,     0.06, 273582569.00, 939661699.00, 16788129.00,      87.00,       2590.00, 61939133.00,     1, 3.43, 1.00, 4.42 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,          10,     0.06, 273395829.00, 939659073.00, 16789066.00,      67.00,       2528.00, 63699046.00,     1, 3.44, 1.00, 4.29 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,          10,     0.02, 77870877.00, 151075372.00, 16785301.00,      20.00,       1257.00, 18375350.00,     1, 1.94, 1.00, 4.24 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,          50,     0.07, 273835495.00, 939768240.00, 16793215.00,     414.00,       3419.00, 65133328.00,     1, 3.43, 1.00, 4.20 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,          50,     0.06, 273657706.00, 939670332.00, 16789396.00,      54.00,       2619.00, 60936018.00,     1, 3.43, 1.00, 4.49 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,          50,     0.02, 76236858.00, 151083471.00, 16785585.00,      13.00,       1304.00, 18636876.00,     1, 1.98, 1.00, 4.09 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,          90,     0.06, 273597355.00, 939665698.00, 16788226.00,      77.00,       2611.00, 62100741.00,     1, 3.43, 1.00, 4.41 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,          90,     0.06, 273525210.00, 939671119.00, 16789468.00,      40.00,       2672.00, 63595816.00,     1, 3.44, 1.00, 4.30 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,          90,     0.02, 78198494.00, 151082048.00, 16785548.00,      53.00,       1309.00, 18416917.00,     1, 1.93, 1.00, 4.25 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
        count64,          99,     0.06, 273549464.00, 939659512.00, 16788050.00,      40.00,       2547.00, 62205638.00,     1, 3.44, 1.00, 4.40 
           name, selectivity, time sec,       cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
  count64BrFree,          99,     0.06, 273578885.00, 939659786.00, 16789064.00,      55.00,       2569.00, 62368798.00,     1, 3.43, 1.00, 4.39 
           name, selectivity, time sec,      cycles, instructions,   L1-misses, LLC-misses, branch-misses,  task-clock, scale,  IPC, CPUs,  GHz 
    count64SIMD,          99,     0.02, 77774420.00, 151074873.00, 16785304.00,      16.00,       1258.00, 18542856.00,     1, 1.94, 1.00, 4.19 
