False Sharing, Made Obvious

A minimal, controlled set of microbenchmarks that make CPU cache-line sharing visibly hurt.

This repository exists to answer a deceptively simple question:

Why does false sharing sometimes look harmless — and sometimes catastrophic?

The short answer is: it depends on whether coherence latency is on the critical path.

This project demonstrates that fact with four tiny programs, arranged along two orthogonal axes, and nothing else.

What this project is (and is not)

This is:

A controlled experiment
A teaching benchmark
A coherence-mechanics demo
Reproducible on commodity hardware

This is not:

A general performance benchmark
A realistic workload
A cache simulator
A library or framework

Every line exists to isolate a single effect.

The two axes of the experiment

The benchmarks are organized as a 2×2 matrix.

Axis 1 — Cache-line placement

Case	Description
`false_sharing`	Two threads update different variables that reside on the same cache line
`padded_no_sharing`	Each thread updates a variable on its own cache line

This isolates false sharing vs no sharing.

Axis 2 — Operation semantics

Case	Description
`store-only`	Each iteration performs a single store (latency may be hidden)
`read-modify-write`	Each iteration performs an explicit load → modify → store

This isolates whether coherence latency is exposed to the critical path.

Directory layout

false-sharing/
├── bench.sh
├── Makefile
├── README.md
├── results/
│   ├── notes.md
│   └── sample_output.txt
└── src/
    ├── common.h
    ├── store-only/
    │   ├── false_sharing.c
    │   └── padded_no_sharing.c
    └── read-modify-write/
        ├── false_sharing.c
        └── padded_no_sharing.c

The directory structure mirrors the experimental matrix exactly.

What `common.h` provides

All four programs share a single header that defines:

ITERATIONS — total loop count per thread
NTHREADS — number of worker threads
CACHELINE_SIZE — assumed cache-line width (64 bytes)
pin_thread(cpu) — pins each thread to a specific CPU
now_ns() — high-resolution monotonic timing
compiler_barrier() — prevents loop hoisting or reordering

This guarantees:

identical thread placement
identical timing logic
identical progress reporting

Any observed difference is therefore architectural, not accidental.

What the code actually does

Each benchmark:

Pins two threads to separate CPUs
Runs a tight loop for a fixed number of iterations
Updates a thread-local pointer on each iteration
Periodically updates a progress counter for visibility
Prints elapsed time and throughput

No syscalls in the hot path. No locks. No atomics in the loop.

Why store-only sometimes lies

In the store-only benchmarks, the hot loop looks like:

(*local)++;

On modern CPUs this often compiles to:

a buffered store
retired without waiting for coherence

This means false sharing can exist without stalling the pipeline.

As a result:

false sharing may appear cheap
throughput differences may look small

This is not because false sharing is harmless — it is because its cost is hidden.

Why read–modify–write tells the truth

In the read-modify-write benchmarks, the loop becomes:

uint64_t v = *local;
v++;
*local = v;

Now the CPU must:

Obtain exclusive ownership of the cache line
Wait for invalidations on other cores
Complete the load before retiring the store

Coherence latency is now unavoidable.

This is where false sharing becomes visibly expensive.

Expected qualitative results

You should observe something like:

Case	Relative speed
store-only + padded	fastest
store-only + false sharing	slightly slower
RMW + padded	slower
RMW + false sharing	much slower

Exact numbers depend on:

CPU model
cache hierarchy
interconnect

The ordering should remain stable.

Why threads are pinned

Without pinning:

the OS may migrate threads
cache ownership becomes unstable
results become noisy

Pinning ensures:

each cache line has a consistent owner
coherence traffic is real, not incidental

This is essential for a teaching benchmark.

Why progress printing exists

Each benchmark prints elapsed time periodically to stderr.

This serves two purposes:

confirms the program is still running
makes long-running behavior visible in asciinema recordings

Progress updates are deliberately rare to avoid perturbing the measurement.

How to build and run

All commands are run from the repository root.

Option 1 — Build everything via Make

make

This builds all four binaries using identical flags.

Option 2 — Build each benchmark explicitly

This is the most explicit and transparent way to build the benchmarks.

Store-only / false sharing

clang -O2 -pthread src/store-only/false_sharing.c \
    -o store_false_sharing

Store-only / no sharing (padded)

clang -O2 -pthread src/store-only/padded_no_sharing.c \
    -o store_padded_no_sharing

Read–modify–write / false sharing

clang -O2 -pthread src/read-modify-write/false_sharing.c \
    -o rmw_false_sharing

Read–modify–write / no sharing (padded)

clang -O2 -pthread src/read-modify-write/padded_no_sharing.c \
    -o rmw_padded_no_sharing

Running the benchmarks

Run each binary directly:

./store_false_sharing
./store_padded_no_sharing
./rmw_false_sharing
./rmw_padded_no_sharing

Each program prints elapsed time and throughput on completion, while periodically reporting progress to stderr during execution.

Automation and Benchmarking

This repository includes a dedicated benchmarking suite to automate the 2×2 matrix execution and capture timing data consistently.

1. Build the Suite

Use the provided Makefile to compile all variants with the correct include paths and optimization levels:

make

2. Run the Automated Benchmark
The bench_mark.sh script executes each binary and extracts the "Real" time (wall-clock time) for direct comparison:

chmod +x bench_mark.sh
./bench_mark.sh



## Why this benchmark is intentionally small

Large benchmarks obscure causality.

This project avoids:

* abstraction layers
* helper libraries
* configurable knobs

So that when performance changes, **there is only one place to look**.

---

## Takeaway

False sharing is not binary.

It can be:

* present but hidden
* present and catastrophic

Whether you *see* it depends on whether your code forces coherence latency onto the critical path.

This repository exists to make that distinction impossible to ignore.

---

## License

MIT. hack freely

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

False Sharing, Made Obvious

What this project is (and is not)

The two axes of the experiment

Axis 1 — Cache-line placement

Axis 2 — Operation semantics

Directory layout

What `common.h` provides

What the code actually does

Why store-only sometimes lies

Why read–modify–write tells the truth

Expected qualitative results

Why threads are pinned

Why progress printing exists

How to build and run

Option 1 — Build everything via Make

Option 2 — Build each benchmark explicitly

Store-only / false sharing

Store-only / no sharing (padded)

Read–modify–write / false sharing

Read–modify–write / no sharing (padded)

Running the benchmarks

Automation and Benchmarking

1. Build the Suite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
results		results
src		src
Makefile		Makefile
README.md		README.md
bench_mark.sh		bench_mark.sh
rmw_false_sharing		rmw_false_sharing
rmw_padded_no_sharing		rmw_padded_no_sharing
store_false_sharing		store_false_sharing
store_padded_no_sharing		store_padded_no_sharing

Folders and files

Latest commit

History

Repository files navigation

False Sharing, Made Obvious

What this project is (and is not)

The two axes of the experiment

Axis 1 — Cache-line placement

Axis 2 — Operation semantics

Directory layout

What common.h provides

What the code actually does

Why store-only sometimes lies

Why read–modify–write tells the truth

Expected qualitative results

Why threads are pinned

Why progress printing exists

How to build and run

Option 1 — Build everything via Make

Option 2 — Build each benchmark explicitly

Store-only / false sharing

Store-only / no sharing (padded)

Read–modify–write / false sharing

Read–modify–write / no sharing (padded)

Running the benchmarks

Automation and Benchmarking

1. Build the Suite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `common.h` provides

Packages