### A3.4.2. Hardware Performance Counters

$$
\text{CPI} = \frac{\text{Cycles}}{\text{Instructions}} = \text{CPI}_{\text{base}} + \sum_{i} \text{miss\_rate}_i \times \text{penalty}_i
$$

where each $i$ indexes a stall source (L1 miss, L2 miss, branch miss, TLB miss).

**Explanation:**

**Hardware Performance Counters (HPCs)** are special-purpose registers built into the CPU that count micro-architectural events without software instrumentation. They enable precise, low-overhead measurement of what the hardware is actually doing.

**Counter Categories:**

| Category | Example Events |
|----------|---------------|
| Execution | `instructions`, `cycles`, `branches`, `branch-misses` |
| Memory | `cache-references`, `cache-misses`, `L1-dcache-load-misses`, `LLC-load-misses` |
| TLB | `dTLB-load-misses`, `iTLB-load-misses` |
| Frontend | `frontend-stalls`, `icache-misses` |
| Backend | `backend-stalls`, `resource-stalls` |

**Counter Modes:**

- **Counting** ‚Äî accumulate total events for the run (`perf stat`).
- **Sampling** ‚Äî interrupt every N events, record the instruction pointer. Build a histogram of hot spots.

**Top-Down Microarchitecture Analysis (TMA):**

Intel's methodology classifies pipeline slots into:

1. **Retiring** ‚Äî slots that produced useful work (good).
2. **Bad Speculation** ‚Äî slots wasted on mispredicted branches.
3. **Frontend Bound** ‚Äî slots lost because the frontend couldn't deliver instructions.
4. **Backend Bound** ‚Äî slots lost because execution units or memory were busy.

**Multiplexing:**

CPUs have limited counter registers (typically 4‚Äì8). When monitoring more events, `perf` time-multiplexes counters and scales the counts, introducing estimation error.

**Example:**

```bash
$ perf stat -e L1-dcache-load-misses,LLC-load-misses,dTLB-load-misses ./program
    15,000,000  L1-dcache-load-misses
     2,000,000  LLC-load-misses       # going to DRAM
       500,000  dTLB-load-misses
```

In [None]:
from dataclasses import dataclass


@dataclass
class CounterSnapshot:
    label: str
    cycles: int
    instructions: int
    l1d_misses: int
    l1d_accesses: int
    llc_misses: int
    llc_accesses: int
    dtlb_misses: int
    dtlb_accesses: int
    branch_misses: int
    branches: int


def analyze_counters(snapshot):
    ipc = snapshot.instructions / snapshot.cycles
    cpi = snapshot.cycles / snapshot.instructions

    l1d_miss_rate = snapshot.l1d_misses / snapshot.l1d_accesses
    llc_miss_rate = snapshot.llc_misses / snapshot.llc_accesses
    dtlb_miss_rate = snapshot.dtlb_misses / snapshot.dtlb_accesses
    branch_miss_rate = snapshot.branch_misses / snapshot.branches

    l1_penalty_cycles = 12
    llc_penalty_cycles = 200
    tlb_penalty_cycles = 30
    branch_penalty_cycles = 18

    stall_contributions = {
        "L1D miss": snapshot.l1d_misses * l1_penalty_cycles,
        "LLC miss": snapshot.llc_misses * llc_penalty_cycles,
        "dTLB miss": snapshot.dtlb_misses * tlb_penalty_cycles,
        "Branch miss": snapshot.branch_misses * branch_penalty_cycles,
    }

    print(f"\n{snapshot.label}:")
    print(f"  IPC: {ipc:.2f}  (CPI: {cpi:.2f})")
    print(f"  L1D miss rate: {l1d_miss_rate:.2%}")
    print(f"  LLC miss rate: {llc_miss_rate:.2%}")
    print(f"  dTLB miss rate: {dtlb_miss_rate:.4%}")
    print(f"  Branch miss rate: {branch_miss_rate:.2%}")

    total_stall_cycles = sum(stall_contributions.values())
    print(f"  Estimated stall cycles: {total_stall_cycles:,} ({total_stall_cycles/snapshot.cycles:.1%} of total)")

    sorted_stalls = sorted(stall_contributions.items(), key=lambda item: item[1], reverse=True)
    print("  Stall breakdown:")
    for source, stall_cycles in sorted_stalls:
        fraction = stall_cycles / total_stall_cycles
        print(f"    {source:>14}: {stall_cycles:>14,} cycles ({fraction:.0%})")


good_workload = CounterSnapshot(
    label="Optimized (tiled, sorted)",
    cycles=3_200_000_000, instructions=8_500_000_000,
    l1d_misses=5_000_000, l1d_accesses=2_000_000_000,
    llc_misses=200_000, llc_accesses=5_000_000,
    dtlb_misses=10_000, dtlb_accesses=2_000_000_000,
    branch_misses=100_000, branches=1_500_000_000,
)

bad_workload = CounterSnapshot(
    label="Unoptimized (naive, random)",
    cycles=12_800_000_000, instructions=6_000_000_000,
    l1d_misses=400_000_000, l1d_accesses=2_000_000_000,
    llc_misses=50_000_000, llc_accesses=400_000_000,
    dtlb_misses=5_000_000, dtlb_accesses=2_000_000_000,
    branch_misses=75_000_000, branches=1_500_000_000,
)

analyze_counters(good_workload)
analyze_counters(bad_workload)

**References:**

[üìò Gregg, B. (2020). *Systems Performance: Enterprise and the Cloud (2nd ed.).* Addison-Wesley.](https://www.brendangregg.com/systems-performance-2nd-edition-book.html)

[üìò Yasin, A. (2014). *A Top-Down Method for Performance Analysis and Counters Architecture.* IEEE ISPASS.](https://ieeexplore.ieee.org/document/6844459)

---

[‚¨ÖÔ∏è Previous: Linux perf Tool](./01_linux_perf_tool.ipynb)