Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,39 +1,45 @@
---
title: Introduction to C++ memory models
title: Introduction to C++ Memory Models
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## What is a memory model?
## What is a Memory Model?

A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures. You can think of memory ordering in 4 broad categories.
A programming language’s memory model defines how operations on shared data can interleave at runtime. It sets rules for how compilers and hardware might reorder these operations.

- **Source Code Order**: The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
In C++, the memory model specifically defines how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures.

You can think of memory ordering as falling into four broad categories:

1. **Source Code Order** - the exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.

Here is an example:

```output
int x = 5; // A
int z = x * 5 // B
int y = 42 // C
int z = x * 5; // B
int y = 42; // C
```

- **Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints to create a program that takes fewer cycles. Although the statements may appear in a particular order in your source code, the compiler could restructure them if it deems it safe. For example, the pseudo assembly below reorders the source line instructions above.
2. **Program Order** - the logical sequence that the compiler recognizes, and it might rearrange or optimize instructions under certain constraints to create a program that executes in fewer cycles. Although your source code lists statements in a particular order, the compiler can restructure them if it deems it safe. For example, the pseudo-assembly below reorders the source instructions:

```output
LDR R1 #5 // A
LDR R2 #42 // C
MULT R3, #R1, #5 // B
```

- **Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an Arm-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
3. **Execution Order** - this is the order in which the hardware actually issues and executes instructions. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an Arm-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.

- **Hardware Perceived Order**: This is the perspective observed by other devices in the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and Arm, and this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
4. **Hardware Perceived Order** - this is the perspective observed by other devices in the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and Arm, and this should be considered when porting applications.

![abstract_model](./multi-copy-atomic.png)
## High-level differences between the Arm Memory Model and the x86 Memory Model

## High-level differences between the Arm memory model and the x86 memory model
The memory models of Arm and x86 architectures differ in terms of ordering guarantees and required synchronizations.

The memory models of Arm and x86 architectures differ in terms of ordering guarantees and required synchronizations. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.

In contrast, Arm’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behavior.
In contrast, Arm’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores can be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behavior.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: The C++ memory model and atomics
title: The C++ Memory Model and Atomics
weight: 3

### FIXED, DO NOT MODIFY
Expand All @@ -8,9 +8,9 @@ layout: learningpathall

## The C++ memory model for single threads

For a long time, writing C++ programs on single-core systems was relatively straightforward. The compiler could reorder instructions however it wished, so long as the program’s observable behavior remained unchanged. This optimization freedom is commonly referred to as the “as-if” rule. Essentially, compilers can optimize away or move instructions around as if the code had not changed, provided they do not affect inputs, outputs, or volatile accesses.
For a long time, writing C++ programs on single-core systems was straightforward. Compilers could reorder instructions freely, as long as the program’s observable behavior remained unchanged. This flexibility is commonly referred to as the “as-if” rule. Essentially, compilers could optimize away or move instructions around as if the code had not changed, provided the changes did not affect inputs, outputs, or volatile memory accesses.

The single-threaded world was simpler: you wrote code, the compiler made it faster (by safely reordering or eliminating instructions), and performance benefited. Over time, multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not only about performance because it could change the meaning of programs with threads reading and writing shared data simultaneously.
The single-threaded world was simpler: you wrote code, the compiler safely reordered or eliminated instructions to make it faster, and your program performed better. But as multi-core processors and multi-threaded applications became common, instruction reordering was not only about improving performance - it could actually change the meaning of programs, especially when multiple threads accessed shared data simultaneously.

### Expanding the memory model for multiple threads

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Race condition example
title: Walk through a Race condition example
weight: 4

### FIXED, DO NOT MODIFY
Expand All @@ -8,31 +8,35 @@ layout: learningpathall

## Example of a race condition when porting from x86 to Arm

Due to the differences in the hardware perceived ordering as explained in the earlier sections, source code written for x86 may behave differently when ported to Arm. To demonstrate this we will create a trivial example and run it both on an x86 and Arm cloud instance.
Due to the differences in the hardware memory ordering, as explained in the earlier sections, source code written for x86 can behave differently when ported to Arm.

Start an Arm-based cloud instance. This example uses a `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS, but other instances types are possible.
To demonstrate this, this Learning Path walks you through a simple example that is run on both x86 and Arm cloud instance.

If you are new to cloud-based virtual machines, refer to [Get started with Servers and Cloud Computing](/learning-paths/servers-and-cloud-computing/intro/).
### Get Started

Start an Arm-based cloud instance. This example uses a `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS, but you can use other instances types.

If you are new to cloud-based virtual machines, see [Get started with Servers and Cloud Computing](/learning-paths/servers-and-cloud-computing/intro/).

First confirm you are using a Arm-based instance with the following command.

```bash
uname -m
```
You should see the following output.
You should see the following output:

```output
aarch64
```

Next, install the required software packages.
Next, install the required software packages:

```bash
sudo apt update
sudo apt install g++ clang -y
```

Use a text editor to copy and paste the following code snippet into a file named `relaxed_memory_ordering.cpp`.
Use a text editor to copy and paste the following code snippet into a file named `relaxed_memory_ordering.cpp`:

```cpp
#include <iostream>
Expand Down Expand Up @@ -85,31 +89,31 @@ int main() {
}
```

The code above is a small example of a data race condition. Thread A creates a node variable and assigns it the number 42. Thread B checks that the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an uninitialized variable before it has been assigned the value 42 in thread A.
The code above demonstrates a data race condition. Thread A creates a node variable and assigns it the value `42`. Thread B checks that the variable assigned to the Node equals 42. Both threads use `memory_order_relaxed` model, allowing thread B to potentially read an uninitialized variable before thread A assigns the value of `42`.

Compile the program using the GNU compiler:

```bash
g++ relaxed_memory_ordering.cpp -o relaxed_memory_ordering -O3
```

Run the command below to run the binary 10 times. Multiple runs increases the chance of observing a race condition.
Run the binary 10 times to increase the chance of observing a race condition:

```bash
for i in {1..10}; do ./relaxed_memory_ordering; done;
```

If you do not see a race condition, the animation below shows a race condition being triggered on the 3rd run.
If you do not see a race condition, the animation below shows a race condition being triggered on the third run:

![Arm64-race-cond](./aarch64-race-condition.gif)

As the graphic above illustrates, a race condition is not a guarantee but a probability.
As the graphic above illustrates, a race condition is probabilistic, and not guaranteed.

Unfortunately, in production workloads there may be a more subtle probability that may surface under specific workloads. This is the reason race conditions are difficult to spot.
Subtle issues can surface under specific workloads, making them challenging to detect.

### Behavior on an x86 instance

Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this, create an connect to an AWS `t2.2xlarge` instance that uses the x86 architecture.
Due to the stronger memory model in x86 processors, programs not adhering to the C++ standard might give programmers a false sense of security. To demonstrate this, create an connect to an AWS `t2.2xlarge` instance that uses the x86 architecture.

Running the following command I can observe the underlying hardware is a Intel Xeon E5-2686 Processor

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,20 @@ layout: learningpathall

## How can I detect infrequent race conditions?

ThreadSanitizer, commonly referred to as `TSan`, is a concurrency bug detection tool that identifies data races in multi-threaded programs. By instrumenting code at compile time, TSan dynamically tracks memory operations, monitoring lock usage and detecting inconsistencies in thread synchronization. When it finds a potential data race, it reports detailed information to aid debugging. TSan's overhead can be significant, but it provides valuable insights into concurrency issues often missed by static analysis.
ThreadSanitizer (TSan) is a concurrency bug detection tool that identifies data races in multithreaded programs. By instrumenting code at compile time, `TSan` dynamically tracks memory operations, monitors lock usage, and detects inconsistencies in thread synchronization. When a potential data race is found, `TSan` provides detailed reports to help you debug.

TSan is available through both recent `clang` and `gcc` compilers.
Although its runtime overhead can be significant, `TSan` provides valuable insights into concurrency issues often missed by static analysis tools.

Use the `clang++` compiler to compile the example and run the executable:
`TSan` is available in recent versions of the `clang` and `gcc` compilers.

Compile and run the following example using the `clang++` compiler:

```bash
clang++ relaxed_memory_ordering.cpp -o relaxed_memory_ordering -fsanitize=thread -fPIE -pie -g
./relaxed_memory_ordering
```

The output is similar to:
The output will look similar to:

```output
==================
Expand All @@ -32,16 +34,16 @@ SUMMARY: ThreadSanitizer: data race /home/ubuntu/src/relaxed_memory_ordering.cpp
==================
```

The output highlights a potential data race in the `threadB` function corresponding to the source code expression `n->x != 42`.
This output highlights a potential data race in the `threadB` function, corresponding to the source code expression `n->x != 42`.

## Does TSan have any limitations?

Thread Sanitizer (TSan) is powerful for detecting data races but has notable drawbacks.
While powerful, `TSan` has some notable drawbacks:

First, it only identifies concurrency issues at runtime, meaning any problematic code that isn’t exercised during testing goes unnoticed.
* It identifies concurrency issues only at runtime, meaning code paths not exercised during testing remain unchecked.

Second, if race conditions exist in third-party binaries or libraries, TSan can’t instrument or fix them without access to their source code.
* It cannot instrument or fix race conditions in third-party binaries or libraries without source code access.

Another major limitation is performance overhead: TSan can slow programs by 2 to 20x and requires extra memory, making it challenging for large-scale or real-time systems.
* It introduces significant performance overhead, typically slowing programs by 2 to 20 times and requiring additional memory. This makes it challenging to use in large-scale or real-time systems.

For further information please refer to the [ThreadSanitizer documentation](https://github.com/google/sanitizers/wiki/threadsanitizercppmanual).
For further information, see the [ThreadSanitizer documentation](https://github.com/google/sanitizers/wiki/threadsanitizercppmanual).
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ minutes_to_complete: 45
who_is_this_for: This is an advanced topic for C++ developers porting applications from x86 to Arm and optimizing performance.

learning_objectives:
- Learn about the C++ memory model.
- Learn about the differences between the Arm and x86 memory model.
- Learn best practices for writing C++ on Arm to avoid race conditions.
- Describe at a high level what a memory model does, and the types of memory ordering.
- Describe the differences between the Arm and x86 memory model.
- Employ best practices for writing C++ on Arm to avoid race conditions.

prerequisites:
- Access to an x86 and Arm cloud instance (virtual machine).
Expand All @@ -27,19 +27,19 @@ armips:
- Neoverse
tools_software_languages:
- C++
- ThreadSanitizer (TSan)
- TSan
- Runbook
operatingsystems:
- Linux
- Runbook


further_reading:
- resource:
title: C++ Memory Order Reference Manual
link: https://en.cppreference.com/w/cpp/atomic/memory_order
type: documentation
- resource:
title: Thread Sanitizer Manual
link: Phttps://github.com/google/sanitizers/wiki/threadsanitizercppmanual
link: https://github.com/google/sanitizers/wiki/threadsanitizercppmanual
type: documentation


Expand Down