A minimal, controlled set of microbenchmarks that make CPU cache-line sharing visibly hurt.
This repository exists to answer a deceptively simple question:
Why does false sharing sometimes look harmless — and sometimes catastrophic?
The short answer is: it depends on whether coherence latency is on the critical path.
This project demonstrates that fact with four tiny programs, arranged along two orthogonal axes, and nothing else.
This is:
- A controlled experiment
- A teaching benchmark
- A coherence-mechanics demo
- Reproducible on commodity hardware
This is not:
- A general performance benchmark
- A realistic workload
- A cache simulator
- A library or framework
Every line exists to isolate a single effect.
The benchmarks are organized as a 2×2 matrix.
| Case | Description |
|---|---|
false_sharing |
Two threads update different variables that reside on the same cache line |
padded_no_sharing |
Each thread updates a variable on its own cache line |
This isolates false sharing vs no sharing.
| Case | Description |
|---|---|
store-only |
Each iteration performs a single store (latency may be hidden) |
read-modify-write |
Each iteration performs an explicit load → modify → store |
This isolates whether coherence latency is exposed to the critical path.
false-sharing/
├── bench.sh
├── Makefile
├── README.md
├── results/
│ ├── notes.md
│ └── sample_output.txt
└── src/
├── common.h
├── store-only/
│ ├── false_sharing.c
│ └── padded_no_sharing.c
└── read-modify-write/
├── false_sharing.c
└── padded_no_sharing.c
The directory structure mirrors the experimental matrix exactly.
All four programs share a single header that defines:
ITERATIONS— total loop count per threadNTHREADS— number of worker threadsCACHELINE_SIZE— assumed cache-line width (64 bytes)pin_thread(cpu)— pins each thread to a specific CPUnow_ns()— high-resolution monotonic timingcompiler_barrier()— prevents loop hoisting or reordering
This guarantees:
- identical thread placement
- identical timing logic
- identical progress reporting
Any observed difference is therefore architectural, not accidental.
Each benchmark:
- Pins two threads to separate CPUs
- Runs a tight loop for a fixed number of iterations
- Updates a thread-local pointer on each iteration
- Periodically updates a progress counter for visibility
- Prints elapsed time and throughput
No syscalls in the hot path. No locks. No atomics in the loop.
In the store-only benchmarks, the hot loop looks like:
(*local)++;On modern CPUs this often compiles to:
- a buffered store
- retired without waiting for coherence
This means false sharing can exist without stalling the pipeline.
As a result:
- false sharing may appear cheap
- throughput differences may look small
This is not because false sharing is harmless — it is because its cost is hidden.
In the read-modify-write benchmarks, the loop becomes:
uint64_t v = *local;
v++;
*local = v;Now the CPU must:
- Obtain exclusive ownership of the cache line
- Wait for invalidations on other cores
- Complete the load before retiring the store
Coherence latency is now unavoidable.
This is where false sharing becomes visibly expensive.
You should observe something like:
| Case | Relative speed |
|---|---|
| store-only + padded | fastest |
| store-only + false sharing | slightly slower |
| RMW + padded | slower |
| RMW + false sharing | much slower |
Exact numbers depend on:
- CPU model
- cache hierarchy
- interconnect
The ordering should remain stable.
Without pinning:
- the OS may migrate threads
- cache ownership becomes unstable
- results become noisy
Pinning ensures:
- each cache line has a consistent owner
- coherence traffic is real, not incidental
This is essential for a teaching benchmark.
Each benchmark prints elapsed time periodically to stderr.
This serves two purposes:
- confirms the program is still running
- makes long-running behavior visible in asciinema recordings
Progress updates are deliberately rare to avoid perturbing the measurement.
All commands are run from the repository root.
makeThis builds all four binaries using identical flags.
This is the most explicit and transparent way to build the benchmarks.
clang -O2 -pthread src/store-only/false_sharing.c \
-o store_false_sharingclang -O2 -pthread src/store-only/padded_no_sharing.c \
-o store_padded_no_sharingclang -O2 -pthread src/read-modify-write/false_sharing.c \
-o rmw_false_sharingclang -O2 -pthread src/read-modify-write/padded_no_sharing.c \
-o rmw_padded_no_sharingRun each binary directly:
./store_false_sharing
./store_padded_no_sharing
./rmw_false_sharing
./rmw_padded_no_sharingEach program prints elapsed time and throughput on completion, while periodically reporting progress to stderr during execution.
This repository includes a dedicated benchmarking suite to automate the 2×2 matrix execution and capture timing data consistently.
Use the provided Makefile to compile all variants with the correct include paths and optimization levels:
make
2. Run the Automated Benchmark
The bench_mark.sh script executes each binary and extracts the "Real" time (wall-clock time) for direct comparison:
chmod +x bench_mark.sh
./bench_mark.sh
## Why this benchmark is intentionally small
Large benchmarks obscure causality.
This project avoids:
* abstraction layers
* helper libraries
* configurable knobs
So that when performance changes, **there is only one place to look**.
---
## Takeaway
False sharing is not binary.
It can be:
* present but hidden
* present and catastrophic
Whether you *see* it depends on whether your code forces coherence latency onto the critical path.
This repository exists to make that distinction impossible to ignore.
---
## License
MIT. hack freely