### A3.1.1. Cache Hierarchy

$$
T_{\text{effective}} = h \cdot T_{\text{cache}} + (1 - h) \cdot T_{\text{memory}}
$$

where $h$ is the hit rate, $T_{\text{cache}}$ the cache access latency, and $T_{\text{memory}}$ the main memory latency.

**Explanation:**

Modern CPUs bridge the processor‚Äìmemory speed gap with a **cache hierarchy** ‚Äî small, fast SRAM buffers organized in levels (L1, L2, L3) between the CPU core and main memory (DRAM).

| Level | Typical Size | Typical Latency | Scope |
|-------|-------------|-----------------|-------|
| L1d   | 32‚Äì64 KB    | ~4 cycles       | Per core |
| L1i   | 32‚Äì64 KB    | ~4 cycles       | Per core (instructions) |
| L2    | 256 KB‚Äì1 MB | ~12 cycles      | Per core |
| L3    | 4‚Äì64 MB     | ~40 cycles      | Shared across cores |
| DRAM  | GBs         | ~200 cycles     | System-wide |

**Key Concepts:**

- **Cache line** ‚Äî the unit of transfer between levels, typically 64 bytes. Accessing one byte loads the entire line.
- **Spatial locality** ‚Äî accessing data near recently accessed data (sequential array traversal).
- **Temporal locality** ‚Äî accessing the same data again soon (loop variables).
- **Associativity** ‚Äî how many cache lines a given address can map to (direct-mapped, N-way set-associative, fully associative).
- **Eviction policy** ‚Äî which line to replace on a miss (typically LRU or pseudo-LRU).

**Common Cache Pathologies:**

- **Capacity miss** ‚Äî working set exceeds cache size.
- **Conflict miss** ‚Äî multiple addresses compete for the same set in a set-associative cache.
- **False sharing** ‚Äî two cores write to different variables on the same cache line, causing invalidation traffic.

**Example:**

Row-major traversal of a 2D array accesses contiguous memory (spatial locality), yielding high hit rates. Column-major traversal strides by row length, causing a cache miss on every access.

In [None]:
import numpy as np
import time

MATRIX_SIZE = 4096
matrix = np.random.rand(MATRIX_SIZE, MATRIX_SIZE)

start = time.perf_counter()
row_sum = 0.0
for row in range(MATRIX_SIZE):
    for col in range(MATRIX_SIZE):
        row_sum += matrix[row, col]
row_major_time = time.perf_counter() - start

start = time.perf_counter()
col_sum = 0.0
for col in range(MATRIX_SIZE):
    for row in range(MATRIX_SIZE):
        col_sum += matrix[row, col]
col_major_time = time.perf_counter() - start

cache_line_bytes = 64
element_bytes = matrix.itemsize
elements_per_line = cache_line_bytes // element_bytes

print(f"Matrix: {MATRIX_SIZE}x{MATRIX_SIZE}, element size: {element_bytes} bytes")
print(f"Cache line: {cache_line_bytes} bytes = {elements_per_line} elements")
print(f"Row-major time: {row_major_time:.3f}s")
print(f"Col-major time: {col_major_time:.3f}s")
print(f"Slowdown factor: {col_major_time / row_major_time:.1f}x")

hit_rate = 0.95
cache_latency_cycles = 4
memory_latency_cycles = 200
effective_latency = hit_rate * cache_latency_cycles + (1 - hit_rate) * memory_latency_cycles
print(f"\nEffective latency at {hit_rate:.0%} hit rate: {effective_latency:.1f} cycles")

**References:**

[üìò Hennessy, J. & Patterson, D. (2019). *Computer Architecture: A Quantitative Approach (6th ed.).* Morgan Kaufmann.](https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1)

[üìò Drepper, U. (2007). *What Every Programmer Should Know About Memory.* Red Hat.](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)

---

[Next: Translation Lookaside Buffer ‚û°Ô∏è](./02_translation_lookaside_buffer.ipynb)