Skip to content

Emmanuel326/false-sharing

Repository files navigation

False Sharing, Made Obvious

A minimal, controlled set of microbenchmarks that make CPU cache-line sharing visibly hurt.

This repository exists to answer a deceptively simple question:

Why does false sharing sometimes look harmless — and sometimes catastrophic?

The short answer is: it depends on whether coherence latency is on the critical path.

This project demonstrates that fact with four tiny programs, arranged along two orthogonal axes, and nothing else.


What this project is (and is not)

This is:

  • A controlled experiment
  • A teaching benchmark
  • A coherence-mechanics demo
  • Reproducible on commodity hardware

This is not:

  • A general performance benchmark
  • A realistic workload
  • A cache simulator
  • A library or framework

Every line exists to isolate a single effect.


The two axes of the experiment

The benchmarks are organized as a 2×2 matrix.

Axis 1 — Cache-line placement

Case Description
false_sharing Two threads update different variables that reside on the same cache line
padded_no_sharing Each thread updates a variable on its own cache line

This isolates false sharing vs no sharing.


Axis 2 — Operation semantics

Case Description
store-only Each iteration performs a single store (latency may be hidden)
read-modify-write Each iteration performs an explicit load → modify → store

This isolates whether coherence latency is exposed to the critical path.


Directory layout

false-sharing/
├── bench.sh
├── Makefile
├── README.md
├── results/
│   ├── notes.md
│   └── sample_output.txt
└── src/
    ├── common.h
    ├── store-only/
    │   ├── false_sharing.c
    │   └── padded_no_sharing.c
    └── read-modify-write/
        ├── false_sharing.c
        └── padded_no_sharing.c

The directory structure mirrors the experimental matrix exactly.


What common.h provides

All four programs share a single header that defines:

  • ITERATIONS — total loop count per thread
  • NTHREADS — number of worker threads
  • CACHELINE_SIZE — assumed cache-line width (64 bytes)
  • pin_thread(cpu) — pins each thread to a specific CPU
  • now_ns() — high-resolution monotonic timing
  • compiler_barrier() — prevents loop hoisting or reordering

This guarantees:

  • identical thread placement
  • identical timing logic
  • identical progress reporting

Any observed difference is therefore architectural, not accidental.


What the code actually does

Each benchmark:

  1. Pins two threads to separate CPUs
  2. Runs a tight loop for a fixed number of iterations
  3. Updates a thread-local pointer on each iteration
  4. Periodically updates a progress counter for visibility
  5. Prints elapsed time and throughput

No syscalls in the hot path. No locks. No atomics in the loop.


Why store-only sometimes lies

In the store-only benchmarks, the hot loop looks like:

(*local)++;

On modern CPUs this often compiles to:

  • a buffered store
  • retired without waiting for coherence

This means false sharing can exist without stalling the pipeline.

As a result:

  • false sharing may appear cheap
  • throughput differences may look small

This is not because false sharing is harmless — it is because its cost is hidden.


Why read–modify–write tells the truth

In the read-modify-write benchmarks, the loop becomes:

uint64_t v = *local;
v++;
*local = v;

Now the CPU must:

  1. Obtain exclusive ownership of the cache line
  2. Wait for invalidations on other cores
  3. Complete the load before retiring the store

Coherence latency is now unavoidable.

This is where false sharing becomes visibly expensive.


Expected qualitative results

You should observe something like:

Case Relative speed
store-only + padded fastest
store-only + false sharing slightly slower
RMW + padded slower
RMW + false sharing much slower

Exact numbers depend on:

  • CPU model
  • cache hierarchy
  • interconnect

The ordering should remain stable.


Why threads are pinned

Without pinning:

  • the OS may migrate threads
  • cache ownership becomes unstable
  • results become noisy

Pinning ensures:

  • each cache line has a consistent owner
  • coherence traffic is real, not incidental

This is essential for a teaching benchmark.


Why progress printing exists

Each benchmark prints elapsed time periodically to stderr.

This serves two purposes:

  • confirms the program is still running
  • makes long-running behavior visible in asciinema recordings

Progress updates are deliberately rare to avoid perturbing the measurement.


How to build and run

All commands are run from the repository root.

Option 1 — Build everything via Make

make

This builds all four binaries using identical flags.


Option 2 — Build each benchmark explicitly

This is the most explicit and transparent way to build the benchmarks.

Store-only / false sharing

clang -O2 -pthread src/store-only/false_sharing.c \
    -o store_false_sharing

Store-only / no sharing (padded)

clang -O2 -pthread src/store-only/padded_no_sharing.c \
    -o store_padded_no_sharing

Read–modify–write / false sharing

clang -O2 -pthread src/read-modify-write/false_sharing.c \
    -o rmw_false_sharing

Read–modify–write / no sharing (padded)

clang -O2 -pthread src/read-modify-write/padded_no_sharing.c \
    -o rmw_padded_no_sharing

Running the benchmarks

Run each binary directly:

./store_false_sharing
./store_padded_no_sharing
./rmw_false_sharing
./rmw_padded_no_sharing

Each program prints elapsed time and throughput on completion, while periodically reporting progress to stderr during execution.



Automation and Benchmarking

This repository includes a dedicated benchmarking suite to automate the 2×2 matrix execution and capture timing data consistently.

1. Build the Suite

Use the provided Makefile to compile all variants with the correct include paths and optimization levels:

make

2. Run the Automated Benchmark
The bench_mark.sh script executes each binary and extracts the "Real" time (wall-clock time) for direct comparison:

chmod +x bench_mark.sh
./bench_mark.sh



## Why this benchmark is intentionally small

Large benchmarks obscure causality.

This project avoids:

* abstraction layers
* helper libraries
* configurable knobs

So that when performance changes, **there is only one place to look**.

---

## Takeaway

False sharing is not binary.

It can be:

* present but hidden
* present and catastrophic

Whether you *see* it depends on whether your code forces coherence latency onto the critical path.

This repository exists to make that distinction impossible to ignore.

---

## License

MIT. hack freely

About

Minimal benchmarks demonstrating false sharing and cache coherence effects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors