diff --git a/content/learning-paths/servers-and-cloud-computing/memory_consistency/_index.md b/content/learning-paths/servers-and-cloud-computing/memory_consistency/_index.md index 01555c94c7..8ff4e47872 100644 --- a/content/learning-paths/servers-and-cloud-computing/memory_consistency/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/memory_consistency/_index.md @@ -29,13 +29,6 @@ tools_software_languages: - Litmus7 - Arm ISA -test_images: - - ubuntu:latest -test_link: null -test_maintenance: false -test_status: - - passed - further_reading: - resource: title: Arm Architecture Reference Manual for A-profile architecture diff --git a/content/learning-paths/servers-and-cloud-computing/memory_consistency/arm_mem_model.md b/content/learning-paths/servers-and-cloud-computing/memory_consistency/arm_mem_model.md index fdb1d829c7..a8129b6264 100644 --- a/content/learning-paths/servers-and-cloud-computing/memory_consistency/arm_mem_model.md +++ b/content/learning-paths/servers-and-cloud-computing/memory_consistency/arm_mem_model.md @@ -6,40 +6,69 @@ layout: "learningpathall" ## CPU Memory Model vs Language/Runtime Memory Models -It's important to understand that the majority of developers will not need to concern themselves with the memory consistency model of the CPU their code will execute on. This is because programming languages and runtime engines abstract away the CPUs memory model by presenting the programmer with a language/runtime memory model. This abstraction is achieved by providing developers with a set of language/runtime specific memory ordering rules, synchronization constructs, and supporting libraries. So long as the developer uses these correctly, language compilers and runtime engines will make sure the code executes correctly on any CPU regardless of how strong or weakly ordered it is. +Majority of developers will not need to be deeply familiar with the memory consistency model of the CPU their code will execute on. This is because programming languages and runtime engines abstract away the CPUs memory model by presenting the programmer with a language/runtime memory model. This abstraction is achieved by providing developers with a set of language/runtime specific memory ordering rules, synchronization constructs, and supporting libraries. As long as the developer uses these correctly, language compilers and runtime engines will make sure the code executes correctly on any CPU regardless of how strong or weakly ordered it is. That said, developers may want to dig deeper into this topic for various reasons including: -- Extract more performance from weakly ordered CPUs (like Arm). - - Keep in mind that compilers and runtimes will do a good job of maximizing performance. Only in well understood niche cases will there be a potential for performance gain by going beyond what the compiler/runtime would do to higher level code. For most cases, all it takes to get more performance is to use the latest compilers, compiler switches, and runtimes. +- Extract more performance from weakly ordered CPUs (like Arm CPUs). + - Compilers and runtimes will do a good job of maximizing performance. Only in well understood niche cases will there be a potential for performance gain by going beyond what the compiler/runtime would do to higher level code. For most cases, all it takes to get more performance is to use the latest compilers, compiler switches, and runtimes. - Develop confidence in the correctness of synchronization constructs. - Develop an understanding of how compilers and runtimes select different machine instructions while still honoring the memory ordering rules of the language/runtime and CPU. -- General learning. -## What We Will Do - -In this Learning Path, we will use publicly available tools to explore thread synchronization on Arm CPUs. This will provide the reader with enough working knowledge of the tools to be able to explore on their own. At the end of this Learning Path, we will provide more information on where readers can find a more formal (and complex) treatment of this subject. +In this Learning Path, you will use publicly available tools to explore thread synchronization on Arm CPUs. You will gain enough working knowledge of the tools to be able to explore thread synchronization concepts. At the end of this Learning Path, you will be able access more information to get a deeper understanding of this subject. ## The Formal Definition of the Arm Memory Model -The formal definition of the Arm memory model is in the [Arm Architecture Reference Manual for A-profile architecture](https://developer.arm.com/documentation/ddi0487/la) (Arm ARM) under the section called `The AArch64 Application Level Memory Model`. The ordering requirements defined in the Arm ARM is a transliteration of the `aarch64.cat` file hosted on the Arm [herd7 simulator tool](https://developer.arm.com/herd7). In fact, this `aarch64.cat` file is the authoritative definition of the Arm memory model. Since it is a formal definition, it is complex. +The formal definition of the Arm memory model is in the [Arm Architecture Reference Manual for A-profile architecture](https://developer.arm.com/documentation/ddi0487/la) (Arm ARM) under the section called `The AArch64 Application Level Memory Model`. The ordering requirements defined in the Arm ARM is a transliteration of the `aarch64.cat` file hosted on the Arm [herd7 simulator tool](https://developer.arm.com/herd7). In fact, this `aarch64.cat` file is the authoritative definition of the Arm memory model. As it is a formal definition, it is complex. ## Herd7 Simulator & Litmus7 Tool -The herd7 simulator provides a way to test snippets of Arm assembly against the formal definition of the Arm memory model (the `aarch64.cat` file mentioned above). The litmus7 tool can take the same snippets of assembly that run on herd7 and run them on actual Arm HW. This allows for comparing the formal memory model to the memory model of an actual Arm CPU. These snippets of assembly are called litmus tests. +The herd7 simulator provides a way to test snippets of Arm assembly against the formal definition of the Arm memory model (the `aarch64.cat` file mentioned above). The litmus7 tool can take the same snippets of assembly that run on herd7 and run them on actual Arm hardware. This allows for comparing the formal memory model to the memory model of an actual Arm CPU. These snippets of assembly are called litmus tests. It's important to understand that it is possible for an implementation of an Arm CPU to be more strongly ordered than the formally defined Arm memory model. This case is not a violation of the memory model because it will still execute code in a way that is compliant with the memory ordering rules. -## Installing the Tools +## Install the Tools + +Herd7 and Litmus7 are part of the [diy7](http://diy.inria.fr/) tool suite. The diy7 tool suite can be installed by following the [installation instructions](http://diy.inria.fr/sources/index.html). You will install the tools on an Arm Linux system so that both herd7 and litmus7 can be compared side by side on the same system. + +Start an Arm-based cloud instance. This example uses a `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS, but other instances types are possible. + +If you are new to cloud-based virtual machines, refer to [Get started with Servers and Cloud Computing](/learning-paths/servers-and-cloud-computing/intro/). + +First confirm you are using a Arm-based instance with the following command. + +```bash +uname -m +``` +You should see the following output. + +```output +aarch64 +``` + +Next, install OCaml Package Manager (opam). You will need `opam` to install the `herdtools7` tool suite: +```bash +sudo apt update +sudo apt install opam -y +``` + +Setup `opam` to install the tools: +```bash +opam init +opam update +eval $(opam env) +``` -Herd7 and Litmus7 are part of the [diy7](http://diy.inria.fr/) tool suite. The diy7 tool suite can be installed by following the [installation instructions](http://diy.inria.fr/sources/index.html). We suggest installing this on an Arm system so that both herd7 and litmus7 can be compared side by side on the same system. +Now install the `herdtool7` tool suite which include both `litmus7` and `herd7`: +```bash +opam install herdtools7 +``` -## Running Herd7 and Litmus7 Example Commands -The test file is assumed to be called `test.litmus` and is in the current directory. +## Herd7 and Litmus7 Example Commands -The help menu shows all options. +You can run `--help` on both the tools to review all the options available. ``` herd7 --help ``` @@ -47,6 +76,10 @@ herd7 --help litmus7 --help ``` +The input to both `herd7` and `litmus7` tools are snippets of assembly code, called litmus tests. + +Shown below are some example of running the tools with a litmus test. In the next section, you will go through an actual litmus test example. + Example of running herd7. ``` herd7 ./test.litmus diff --git a/content/learning-paths/servers-and-cloud-computing/memory_consistency/examples.md b/content/learning-paths/servers-and-cloud-computing/memory_consistency/examples.md index 4919fc6cee..6cd87bc43b 100644 --- a/content/learning-paths/servers-and-cloud-computing/memory_consistency/examples.md +++ b/content/learning-paths/servers-and-cloud-computing/memory_consistency/examples.md @@ -10,15 +10,16 @@ layout: "learningpathall" ## Arm Instructions with Acquire-Release Ordering -Most of the examples below use the instructions `LDAR` and `STLR` which are load-acquire and store-release respectively. However, there are other instructions that support acquire-release ordering. More specifically, the various atomic instructions that were made mandatory as of Armv8.1. These include Compare and Swap (`CAS`), Swap (`SWP`), Load-Add (`LDADD`), Store-ADD (`STADD`), and more. We leave it as an exercise for the reader to explore these. +Most of the examples shown in this section use the instructions `LDAR` and `STLR` which are load-acquire and store-release respectively. However, there are other instructions that support acquire-release ordering. More specifically, these include the various atomic instructions that were made mandatory in version Armv8.1 of the Arm architecture. These include Compare and Swap (`CAS`), Swap (`SWP`), Load-Add (`LDADD`), Store-ADD (`STADD`), and more. These are left as an exercise to explore outside this Learning Path. ## Litmus7 Switches -Since `litmus7` executes code on the machine, it needs to build the assembly snippets into executable code. `litmus7` compiles with GCC, however, GCC uses default options. If running on an Arm Neoverse platform, we strongly recommend using the `-ccopts="-mcpu=native"` switch. This is the easiest way to avoid all possible run time failures. For example, if the Compare and Swap example below is executed without this switch, it will fail to run because by default, GCC will not emit the Compare and Swap instruction (`CAS`). This instruction is supported in Armv8.1, but GCC by default builds for Armv8.0. If in the future, the GCC default is updated to build for a newer version of the Arm architecture (e.g. Armv9.0). The `-ccopts` switch will not be needed to run tests that use atomic instructions like Compare and Swap (`CAS`). +As `litmus7` executes code on the machine, it needs to build the assembly snippets into executable code. `litmus7` compiles with GCC, however, GCC uses default options. When you run on an Arm Neoverse platform, it is strongly recommend to use the `-ccopts="-mcpu=native"` switch. This is the easiest way to avoid all possible run time failures. For example, if the Compare and Swap example below is executed without this switch, it will fail to run because by default, GCC will not emit the Compare and Swap instruction (`CAS`). This instruction is supported in Armv8.1 of the Arm architecture, but GCC by default builds for Armv8.0. If in the future, the GCC default is updated to build for a newer version of the Arm architecture (e.g. Armv9.0), the `-ccopts` switch will not be needed to run tests that use atomic instructions like Compare and Swap (`CAS`). ## Example 1: Message passing without barriers The first example highlights one of the pitfalls that may occur under a relaxed memory model. +Create a litmus file with the content shown below in a file named `test1.litmus`: ``` AArch64 MP+Loop @@ -34,29 +35,49 @@ AArch64 MP+Loop exists (1:X0=1 /\ 1:X2=0) ``` -This example is similar to what we saw in the previous section, except this time we've added a loop to check for the message ready flag. If we think about this assembly in a Sequentially Consistent way, we can convince ourselves that the only valid outcome of these two threads executing will be `(1,1)`. Below is the output of this test with `herd7`. +This example is similar to what you saw in the previous section, except this time there is a loop to check for the message ready flag. If you think about this assembly in a Sequentially Consistent way, you can convince yourself that the only valid outcome of these two threads executing will be `(1,1)`. +Run this with `herd7`: + +```bash +herd7 ./test1.litmus ``` + +The output will look like: + +```output Test MP+Loop Allowed States 2 1:X0=1; 1:X2=0; 1:X0=1; 1:X2=1; ``` -We see the outcome of `(1,0)`. Let's try `litmus7` on a Neoverse platform to see if it shows us the same outcomes. +You should see the outcome of `(1,0)`. + +Now let's try to run this same test with `litmus7` on your running Arm Neoverse instance to check if you get the same outcomes. + +Run the following command: +```bash +litmus7 ./test1.litmus ``` + +The output should look like: + +```output Test MP+Loop Allowed Histogram (2 states) 458 *>1:X0=1; 1:X2=0; 999542:>1:X0=1; 1:X2=1; ``` -`litmus7` shows that most of the time we see the outcome `(1,1)`, but occasionally we see `(1,0)`. This result also highlights why we might want to increase the number of test iterations with `litmus7`. Notice that the `(1,0)` result is rare relative to the result of `(1,1)`. Increasing the number of test iterations will increase the probability that we see all outcomes, including those that are rare. +`litmus7` shows that most of the time you see the outcome `(1,1)`, but occasionally you see `(1,0)`. This result also highlights why you might want to increase the number of test iterations with `litmus7`. Increasing the number of test iterations will increase the probability that we see all outcomes, including those that are rare. -We see the outcome of `(1,0)` because `LDR` and `STR` are ordinary memory accesses. This means that if there are no dependencies between them (there aren't in this case), they can be reordered by the CPU. Both `herd7` and `litmus7` confirm that this can and will happen. The result of `(1,0)` is not desired because it represents reading the payload of a message before the ready flag is set. This is likely not what the programmer intended. +You will see the outcome of `(1,0)` because `LDR` and `STR` are ordinary memory accesses. This means that if there are no dependencies between them (there aren't in this case), they can be reordered by the CPU. Both `herd7` and `litmus7` confirm that this can and will happen. The result of `(1,0)` is not desired because it represents reading the payload of a message before the ready flag is set. This is likely not what the programmer intended. ## Example 2: Message passing with two-way barriers -Let's fix the message passing by adding two-way barriers. Namely, the instruction `DMB` which is a data memory barrier. +Let's fix the message passing by adding two-way barriers. The instruction you will add is a data memory barrier `DMB. +Create a litmus file called `test2.litmus` with the contents shown below: + ``` AArch64 MP+Loop+DMB { @@ -72,34 +93,47 @@ AArch64 MP+Loop+DMB exists (1:X0=1 /\ 1:X2=0) ``` -In this example we add the `DMB` instruction between the `STR` instructions in `P0`, and between the `LDR` instructions in `P1`. The `DMB` prevents the reordering of memory accesses across it. Note that non-memory access instructions can still be reordered across the `DMB`. For example, it's possible for the second `MOV` in `P0` to be executed before the `DMB` because it's not a memory access instruction. Also, on Arm, `DMB` instructions are Sequentially Consistent with respect to other `DMB` instructions. - -Below is the `herd7` output of this test. +In this example you added the `DMB` instruction between the `STR` instructions in `P0`, and between the `LDR` instructions in `P1`. The `DMB` prevents the reordering of memory accesses across it. Note that non-memory access instructions can still be reordered across the `DMB`. For example, it's possible for the second `MOV` in `P0` to be executed before the `DMB` because it's not a memory access instruction. Also, on Arm, `DMB` instructions are Sequentially Consistent with respect to other `DMB` instructions. +Run this test with `herd7`: +```bash +herd7 test2.litmus ``` + +The output will look like: + +```output Test MP+Loop+DMB Allowed States 1 1:X0=1; 1:X2=1; ... Warning: File "test.litmus": unrolling limit exceeded at L1, legal outcomes may be missing. ``` -Now that we have the memory barriers in place, we only see the outcome `(1,1)` which is what is desired for message passing between threads. When we run tests that contain loops with `herd7`, a warning will be shown. This warning appears because `herd7` is a simulator that interleaves instructions to figure out all possible outcomes. A consequence of it working this way means that it will unroll loops first, then test against the unrolled loops. By default, it unrolls loops two times. It is possible to increase this with with `-unroll` switch. That said, it doesn't seem useful to increase the number of unrolls unless there is some very specific and unusual sequencing occurring between the threads being tested. Overall, we strongly recommend that complex scenarios be broken into smaller and more primitive tests. +Now that you have the memory barriers in place, you only see the outcome `(1,1)` which is what is desired for message passing between threads. When you run tests that contain loops with `herd7`, a warning will be shown. This warning appears because `herd7` is a simulator that interleaves instructions to figure out all possible outcomes. A consequence of it working this way means that it will unroll loops first, then test against the unrolled loops. By default, it unrolls loops two times. It is possible to increase this with with `-unroll` switch. That said, it doesn't seem useful to increase the number of unrolls unless there is some very specific and unusual sequencing occurring between the threads being tested. Overall, it is strongly recommend that complex scenarios be broken into smaller and more primitive tests. -Below is the `litmus7` output of this test on a Neoverse platform. +Now lets run the same litmus file with `litmus7`: +```bash +litmus7 test2.litmus ``` + +The output on your Arm Neoverse CPU based machine will looks like: + +```output Test MP+Loop+DMB Allowed Histogram (1 states) 1000000:>1:X0=1; 1:X2=1; ``` 100% of the test runs observed `(1,1)`. This builds confidence that our barriers are working. -A last point is that the `DMB` in `P1` can be relaxed by changing it to `DMB ISHLD`. This relaxation could potentially yield performance improvements in real applications. However, doing the same relaxation to the `DMB` in `P0` will break the message passing. We encourage readers to try this experiment and also read the Arm documentation on the differences between `DMB ISH`, `DMB ISHLD`, and `DMB ISHST`. +A last point is that the `DMB` in `P1` can be relaxed by changing it to `DMB ISHLD`. This relaxation could potentially yield performance improvements in real applications. However, doing the same relaxation to the `DMB` in `P0` will break the message passing. You can try this experiment and also read the Arm documentation on the differences between `DMB ISH`, `DMB ISHLD`, and `DMB ISHST`. ## Example 3: Message passing with One-Way Barriers Let's test message passing with one-way barriers next. This is done by using instructions that support acquire-release ordering. +Create a litmus file called `test3.litmus` with the contents shown below: + ``` AArch64 MP+Loop+ACQ_REL { @@ -115,30 +149,42 @@ exists (1:X0=1 /\ 1:X2=0) ``` -In this example, we drop the `DMB` instructions and instead use a properly placed `STLR` in `P0` and `LDAR` in `P1`. `STLR` is a store-release instruction and `LDAR` is a load-acquire instruction. These are synchronizing memory accesses (as opposed to ordinary memory accesses). The `STLR` prevents earlier memory accesses from reordering after it, while the `LDAR` prevents later memory accesses from reordering before it. This is why these are also called one-way barriers; they block reordering only in one direction. `LDAR` and `STLR` instructions are Sequentially Consistent with respect to other `LDAR` and `STLR` instructions, while ordinary `LDR` and `STR` are not. There is also a more relaxed version of `LDAR` called `LDAPR`, this is a Load-Acquire with Processor Consistency. The Arm documentation on [Load-Acquire and Store-Release instructions](https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions) has more information on this. +In this example, you have removed the `DMB` instructions and instead use a properly placed `STLR` in `P0` and `LDAR` in `P1`. `STLR` is a store-release instruction and `LDAR` is a load-acquire instruction. These are synchronizing memory accesses (as opposed to ordinary memory accesses). The `STLR` prevents earlier memory accesses from reordering after it, while the `LDAR` prevents later memory accesses from reordering before it. This is why these are also called one-way barriers; they block reordering only in one direction. `LDAR` and `STLR` instructions are Sequentially Consistent with respect to other `LDAR` and `STLR` instructions, while ordinary `LDR` and `STR` are not. There is also a more relaxed version of `LDAR` called `LDAPR`, this is a Load-Acquire with Processor Consistency. The Arm documentation on [Load-Acquire and Store-Release instructions](https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions) has more information on this. -Below is the `herd7` output of this test. +Run this test with `herd7`: +```bash +herd7 test3.litmus ``` +The output should look like: + +```output Test MP+Loop+ACQ_REL Allowed States 1 1:X0=1; 1:X2=1; ``` +Only an outcome of `(1,1)` is possible. This is the result you want. -We see the result we want. Only an outcome of `(1,1)` is possible. +Run the same test with `litmus` on your Arm Neoverse CPU based machine: + +```bash +litmus7 test3.litmus +``` -Below is the `litmus7` output of this test on a Neoverse platform. -```\ +The output from this will look like: +```output Test MP+Loop+ACQ_REL Allowed Histogram (1 states) 1000000:>1:X0=1; 1:X2=1; ``` -`litmus7` shows us the same as `herd7`. +`litmus7` shows the same result as `herd7`. ## Example 4: Compare and Swap with One-Way Barriers -Atomic instructions support acquire-release semantics. In this example we look at Compare and Swap with acquire ordering (`CASA`). +Atomic instructions support acquire-release semantics. In this example you will look at Compare and Swap with acquire ordering (`CASA`). + +Create a litmus file `test4.litmus` with the content shown below: ``` AArch64 Lock+Loop+CAS+ACQ_REL @@ -159,31 +205,47 @@ exists (1:X0=1 /\ 1:X2=0) ``` -This test is a representation of a basic spin lock. The lock variable is in address `y`. When it is set to 1, it's locked, when it's set to 0, it's available. This test starts with the lock variable at address `y` set to 1, which means it's locked. `P0` is assumed to be the owner of the lock at the start of the test. `P0` will write to address `x` (the payload), then release the lock by writing a 0 to address `y`. The store to address `y` is a `STLR` (store-release), this ensures that the write to the payload is visible before the release of the lock at address `y`. On `P1`, we spin on address `y` (the lock) with a `CASA`. At each loop iteration, `CASA` checks the value at address `y`. If it's 0 (available), then it will write a 1 to take ownership. If it's 1, the `CASA` fails and loops back to try the `CASA` again. It will continue to loop until it successfully takes the lock. The `CASA` instruction does this operation atomically, and with acquire ordering to ensure that the later `LDR` of address `x` (the payload) is not ordered before the `CASA`. +This test is a representation of a basic spin lock. The lock variable is in address `y`. When it is set to 1, it's locked, when it's set to 0, it's available. This test starts with the lock variable at address `y` set to 1, which means it's locked. `P0` is assumed to be the owner of the lock at the start of the test. `P0` will write to address `x` (the payload), then release the lock by writing a 0 to address `y`. The store to address `y` is a `STLR` (store-release), this ensures that the write to the payload is visible before the release of the lock at address `y`. On `P1`, you spin on address `y` (the lock) with a `CASA`. At each loop iteration, `CASA` checks the value at address `y`. If it's 0 (available), then it will write a 1 to take ownership. If it's 1, the `CASA` fails and loops back to try the `CASA` again. It will continue to loop until it successfully takes the lock. The `CASA` instruction does this operation atomically, and with acquire ordering to ensure that the later `LDR` of address `x` (the payload) is not ordered before the `CASA`. + +Run this test with `herd7`: + +```bash +herd7 test4.litmus +``` + +The output will look like: -Below is the `herd7` output of this test. ``` Test Lock+Loop+CAS+ACQ_REL Allowed States 1 1:X0=1; 1:X2=1; ``` -We see the result we want. Only an outcome of `(1,1)` is possible. +Only an outcome of `(1,1)` is possible. This is the result you want. -Below is the `litmus7` output of this test on a Neoverse platform. Note that when we ran this, we used the switch `-ccopts="-mcpu=native"`. If we didn't, `litmus7` would fail with a message saying that the `CASA` instruction cannot be emitted by the compiler. +Now run the same test with `litmus7` on your Arm Neoverse CPU based machine: + +```bash +litmus7 ./test4.litmus -ccopts="-mcpu=native" ``` +The output should look like: +```output Test Lock+Loop+CAS+ACQ_REL Allowed Histogram (1 states) 1000000:>1:X0=1; 1:X2=1; ``` Only an outcome of `(1,1)` has been observed. More test iterations can be executed to build confidence that this is the only possible result. + Note that when we you ran `litmus7`, you used the switch `-ccopts="-mcpu=native"`. If you didn't, `litmus7` would fail with a message saying that the `CASA` instruction cannot be emitted by the compiler. + Try changing the `CASA` to a `CAS` (Compare and Swap with no ordering) to see what happens. ## Example 5: Dependency as a Barrier It is possible to create barriers that are more relaxed than acquire-releases. This is done by creating dependencies between instructions that block the CPU's scheduler from reordering them. This is what the C/C++ ordering of `memory_order::consume` aims to achieve. However, over the years, it has been challenging for compilers to support this dependency based barrier concept. Thus,`memory_order::consume` gets upgraded to `memory_order::acquire` in practice. In fact, this ordering is in discussion to be dropped from the C/C++ standard. Still, it makes for an interesting example to show here. -To be able to show the difference, we first need to look at the one-way barrier example from before but with a modification. That modification is that we add a second payload memory address. We'll call it `w`. The new test with this additional memory location is shown below. +To be able to show the difference, you first need to look at the one-way barrier example from earlier but with a modification. That modification is that you will add a second payload memory address, call it `w`. + +Copy the content shown below with these modifications into a new litmus test file called `test5.litmus`: ``` AArch64 MP+Loop+ACQ_REL2 @@ -201,25 +263,38 @@ AArch64 MP+Loop+ACQ_REL2 exists (1:X0=1 /\ (1:X2=0 \/ 1:X5=0)) ``` -On `P0`, we are writing to both `w` and `x` before we do the store-release on address `y`. This ensures that the writes to both `x` and `w` (the payloads) will be ordered before the write to address `y` (the flag). On `P1`, we loop with a load-acquire on address `y` (the flag). Once it is observed to be set, we load the two payload addresses. The load-acquire ensures that we do not read the payload addresses `w` and `x` until the flag is observe to be set. The condition at the bottom has been updated to check for any cases where either `w` or `x` are 0. Either of these being observed as 0 will be an indication of reading the payload before the ready flag is observed to be set (not what we want). Overall, this code should result in only the outcome `(1,1,1)`. +On `P0`, you are writing to both `w` and `x` before the store-release on address `y`. This ensures that the writes to both `x` and `w` (the payloads) will be ordered before the write to address `y` (the flag). On `P1`, you loop with a load-acquire on address `y` (the flag). Once it is observed to be set, you load the two payload addresses. The load-acquire ensures that you do not read the payload addresses `w` and `x` until the flag is observe to be set. The condition at the bottom has been updated to check for any cases where either `w` or `x` are 0. Either of these being observed as 0 will be an indication of reading the payload before the ready flag is observed to be set (not what we want). Overall, this code should result in only the outcome `(1,1,1)`. + +Run this test with `herd7`: -Below is the `herd7` output of this test. +```bash +herd7 test5.litmus ``` + +The output should look like: + +```output Test MP+Loop+ACQ_REL2 Allowed States 1 1:X0=1; 1:X2=1; 1:X5=1; ``` +Now run it with `litmus7` on your Arm Neoverse CPU based machine: -Below is the `litmus7` output of this test on a Neoverse platform. +```bash +litmus7 test5.litmus ``` +The output will look like: + +```output Test MP+Loop+ACQ_REL2 Allowed Histogram (1 states) 1000000:>1:X0=1; 1:X2=1; 1:X5=1; ``` -Both `herd7` and `litmus7` show us the expected result. It might be worth increasing the number of iterations on `litmus7` to build more confidence in the result. +Both `herd7` and `litmus7` show the expected result. It might be worth increasing the number of iterations on `litmus7` to build more confidence in the result. + +Now lets remove the load-acquire in `P1` and use a dependency as a barrier. Create a new litmus file `test6.litmus` with the contents shown below: -Now let's remove the load-acquire in `P1` and use a dependency as a barrier. ``` AArch64 MP+Loop+Dep { @@ -236,26 +311,38 @@ AArch64 MP+Loop+Dep exists (1:X0=1 /\ (1:X2=0 \/ 1:X5=0)) ``` -What we've done to `P1` is create a dependency between the first `LDR` which is in the loop, and the second `LDR` which appears after the loop. Register `X6` (which is the same as `W6`) doesn't change the address loaded in the second `LDR` because the previous `AND` instruction zeros the offset. The point of the `AND` is to create a dependency between the first and second `LDR` instructions so that they execute in order. The net effect is that we have a barrier for the address `x` between the two `LDR` instructions. However, we do not have a barrier for the address `y`. In this way, it is a more relaxed barrier than using `LDAR`, because `LDAR` applies to all memory accesses that appear after it in program order. +What you have done to `P1` is create a dependency between the first `LDR` which is in the loop, and the second `LDR` which appears after the loop. Register `X6` (which is the same as `W6`) doesn't change the address loaded in the second `LDR` because the previous `AND` instruction zeros the offset. The point of the `AND` is to create a dependency between the first and second `LDR` instructions so that they execute in order. The net effect is that you have a barrier for the address `x` between the two `LDR` instructions. However, you do not have a barrier for the address `y`. In this way, it is a more relaxed barrier than using `LDAR`, because `LDAR` applies to all memory accesses that appear after it in program order. -Below is the `herd7` output of this test. +Run this test with `herd7`: + +```bash +herd7 test6.litmus ``` + +The output of this test should look like: +```output Test MP+Loop+Dep Allowed States 2 1:X0=1; 1:X2=1; 1:X5=0; 1:X0=1; 1:X2=1; 1:X5=1; ``` -The possible outcome of `(1,1,0)` makes sense because the dependency we've added doesn't cover the read of address `w`. +The possible outcome of `(1,1,0)` makes sense because the dependency you have added doesn't cover the read of address `w`. -Below is the `litmus7` output of this test on a Neoverse platform. +Now run the same test with `litmus7` on your Arm Neoverse CPU based machine: + +```bash +litmus7 test6.litmus ``` +The output should look like: + +```output Test MP+Loop+Dep Allowed Histogram (2 states) -1 *>1:X0=1; 1:X2=1; 1:X5=0; -49999999:>1:X0=1; 1:X2=1; 1:X5=1; +14 *>1:X0=1; 1:X2=1; 1:X5=0; +999986:>1:X0=1; 1:X2=1; 1:X5=1; ``` -We have the same result here too. However, notice that we had to execute the test 50 million times just to observe the reorder of the `LDR` of address `w` just one time. Again, if we don't do enough test iterations, we might miss the observation of a possible outcome. +You have the same result here too. However, notice that you had to execute the test 10 million times just to observe the reorder of the `LDR` of address `w` just fourteen times. Again, if you don't run enough test iterations, you might miss the observation of a possible outcome. ## Experimentation Ideas -We encourage readers to modify and experiment with the example tests shared here. Readers can also create their own examples by using the assembly code generated from higher level languages. Last, there are also other litmus tests posted in the [herd7 online simulator](https://developer.arm.com/herd7) +You should now be able to modify and experiment with the example tests shared in this Learning Path. You can also create your own examples by using the assembly code generated from higher level languages. Another great resource to experiment with are the litmus tests posted in the [herd7 online simulator](https://developer.arm.com/herd7) diff --git a/content/learning-paths/servers-and-cloud-computing/memory_consistency/explore_deeper.md b/content/learning-paths/servers-and-cloud-computing/memory_consistency/explore_deeper.md index 965cb707ed..e9d7a7bc94 100644 --- a/content/learning-paths/servers-and-cloud-computing/memory_consistency/explore_deeper.md +++ b/content/learning-paths/servers-and-cloud-computing/memory_consistency/explore_deeper.md @@ -1,6 +1,6 @@ --- # User change -title: "Exploring More Deeply" +title: "Additional Resources" weight: 5 @@ -8,18 +8,18 @@ weight: 5 layout: "learningpathall" --- -## Exploring the Memory Model More Deeply +## Exploring the Memory Model In Depth -In this Learning Path we took a practical approach to exploring the Arm memory model and thread synchronization. If the reader would like to explore this topic more deeply and more formally, we recommend reading through the following blog series. +In this Learning Path you took a practical approach to exploring the Arm memory model and thread synchronization. If you would like to explore this topic further and more formally, you should read through the following blog series. - [A working example of how to use the herd7 Memory Model Tool](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/how-to-use-the-memory-model-tool) - [How to generate litmus tests automatically with the diy7 tool](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/generate-litmus-tests-automatically-diy7-tool) - [Running litmus tests on hardware using litmus7](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/running-litmus-tests-on-hardware-litmus7) - [Expanding the Memory Model Tools to System-level architecture](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/expanding-memory-model-tools-system-level-architecture) -Also, the [diy7 documentation](https://diy.inria.fr/doc/index.html) links to a few papers that explore memory models more formally. +Also, the [diy7 documentation](https://diy.inria.fr/doc/index.html) provides links to a few papers that explore memory models more formally. -## Docs on Barriers +## Documentation on Barriers - [Barriers on Armv8](https://developer.arm.com/documentation/100941/0101/Barriers). - [Load-Acquire and Store-Release on Armv8](https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions). @@ -27,4 +27,4 @@ Also, the [diy7 documentation](https://diy.inria.fr/doc/index.html) links to a f ## More Tests to Explore -There are many more tests that can be explored in the [herd7 simulator tool](https://developer.arm.com/herd7). \ No newline at end of file +There are many more tests that can be explored in the [herd7 simulator tool](https://developer.arm.com/herd7). diff --git a/content/learning-paths/servers-and-cloud-computing/memory_consistency/litmus_syntax.md b/content/learning-paths/servers-and-cloud-computing/memory_consistency/litmus_syntax.md index 028eb661b6..f8417b06cc 100755 --- a/content/learning-paths/servers-and-cloud-computing/memory_consistency/litmus_syntax.md +++ b/content/learning-paths/servers-and-cloud-computing/memory_consistency/litmus_syntax.md @@ -10,9 +10,12 @@ layout: "learningpathall" ## Herd7 & Litmus7 Test Primer -In this section we will give just enough information to understand the syntax of litmus tests and how to run them. More comprehensive information on developing and runing litmus tests can be found in the [diy7 documentation](https://diy.inria.fr/doc/index.html) +In this section you will get an overview of the syntax of litmus tests and learn how to run them. More comprehensive information on developing and running litmus tests can be found in the [diy7 documentation](https://diy.inria.fr/doc/index.html) + +For this Learning Path, you will use an abbreviated version of the `MP.litmus` test included in the [herd7 online simulator](https://developer.arm.com/herd7). + +Open a file editor of your choice, and copy the contents below into a file called `test.litmus`: -We will use an abbreviated version of the `MP.litmus` test included in the [herd7 online simulator](https://developer.arm.com/herd7). ``` AArch64 MP { @@ -27,10 +30,11 @@ AArch64 MP exists (1:X0=1 /\ 1:X2=0) ``` +Lets inspect this litmus file to gain a better understanding of the assembly code: -- The first line sets the target architecture of this test which is `AArch64`. This is required because the diy7 tool supports other architectures that are not Arm. After the architecture label we have `MP` which is a label that gives this test its name. `MP` stands for "Message Passing". - - In short, the first line say's "This is a test for AAarch64 called `MP`" -- This test defines two threads `P0` and `P1`. It is possible to define more threads than just two, but in this Learning Path we will only work with two threads to keep things manageable. +- The first line sets the target architecture of this test which is `AArch64`. This is required because the diy7 tool supports other architectures other than Arm. After the architecture label is `MP` which is a label that gives this test its name. `MP` stands for "Message Passing". + - In short, the first line says "This is a test for AAarch64 called `MP`" +- This test defines two threads `P0` and `P1`. It is possible to define more threads than just two, but in this Learning Path you will only work with two threads to avoid increasing complexity. - The lines between the brackets ( `{..}` ) determines the initial state of the CPU registers for this test. - `0:X1=x; 0:X3=y;` defines the initial state for `P0`. - Register X1 holds memory address `x`. @@ -40,27 +44,28 @@ exists - Register X3 holds memory address `x`. - Any registers and memory addresses not listed in between these brackets are initialized to 0. - The memory address `x` contains the message to pass, while the memory address `y` is the message ready flag. -- Below the brackets we see the assembly snippets for `P0` and `P1`. +- Below the brackets you see the assembly snippets for `P0` and `P1`. - `P0`. - The first `MOV` and `STR` writes a `1` into memory address `x` (the payload). - The second `MOV` and `STR` writes a `1` into memory address `y` (the flag). - `P1`. - The first `LDR` reads the address `y` (the flag). The value of the flag is stored in register `W0`. - The second `LDR` reads the address `x` (the payload). The value of the payload is stored in register `W2`. - - Notice that we are not looping on the flag like what would normally be done in a message passing scenario. We explore loops in the next section. + - Notice that you are not looping on the flag like what would normally be done in a message passing scenario. You will explore loops in the next section. - The last two lines shows a test condition. This test condition asks the question: - - "On `P1`, is it possible for us to observe register `W0` (the flag) set to 1 **AND** register `W2` (the payload) set to 0?" + - "On `P1`, is it possible to observe register `W0` (the flag) set to 1 **AND** register `W2` (the payload) set to 0?" - Wait...but the condition uses register names `X0` and `X2`, not `W0` and `W2`. See the note below for more. - In this condition check syntax, `/\` is a logical **AND**, while `\/` is a logical **OR** - A note on `X` and `W` registers. - - Notice we are using `X` registers for storing addresses and for doing the condition check, but `W` registers for everything else. + - Notice you are using `X` registers for storing addresses and for doing the condition check, but `W` registers for everything else. - Addresses need to be stored as 64-bit values, hence the need to use `X` registers for the addresses because they are 64-bit. `W` registers are 32-bit. In fact, register `Wn` is the lower 32-bits of register `Xn`. - - Writing the litmus tests this way is cleaner than if we use all `X` registers. If all `X` registers are used, the data type of each register needs to be declared on additional lines. For this reason, most tests are written as shown above. The way this is done may be changed in the future to reduce potential confusion around the mixed use of `W` and `X` registers, but all of this is functionally correct. + - Writing the litmus tests this way is simpler than using all `X` registers. If all `X` registers are used, the data type of each register needs to be declared on additional lines. For this reason, most tests are written as shown above. The way this is done may be changed in the future to reduce potential confusion around the mixed use of `W` and `X` registers, but all of this is functionally correct. -Before we run this test with `herd7` and `litmus7`, let's hypothesize on what might be observed in registers `X0` (flag) and `X2` (payload) on thread `P1`. Even though we know Arm machines are not Sequentially Consistent (modern machines usually aren't), let's start by assuming this to be the case anyway. To reason on the execution of these threads in a Sequentially Consistent way, means to imagine the instructions on `P0` and `P1` are executed in some interleaved order. Further, if we interleave these instructions in all possible permutations, we can figure out all of the possible valid outcomes of registers `X0` (flag) and `X2` (payload) on `P1`. For the example test above, the possible valid outcomes of `(X0,X2)` (or `(flag,data)`) are `(0,0)`, `(0,1)`, & `(1,1)`. Below we show some permutations that result in these valid outcomes. These are not all the possible instruction permutations for this test. Listing them all would make this section needlessly long. +Before you run this test with `herd7` and `litmus7`, let's hypothesize on what might be observed in registers `X0` (flag) and `X2` (payload) on thread `P1`. Even though Arm machines are not Sequentially Consistent (modern machines usually aren't), let's start by assuming this to be the case anyway. To reason on the execution of these threads in a Sequentially Consistent way, means to imagine the instructions on `P0` and `P1` are executed in some interleaved order. Further, if you interleave these instructions in all possible permutations, you can figure out all of the possible valid outcomes of registers `X0` (flag) and `X2` (payload) on `P1`. For the example test above, the possible valid outcomes of `(X0,X2)` (or `(flag,data)`) are `(0,0)`, `(0,1)`, & `(1,1)`. Shown below are some permutations that result in these valid outcomes. These are not all the possible instruction permutations for this test. Listing them all would make this section needlessly long. A permutation that results in `(0,0)`: -``` + +```output (P1) LDR W0, [X1] # P1 reads flag, gets 0 (P1) LDR W2, [X3] # P1 reads payload, gets 0 (P0) MOV W0, #1 @@ -71,7 +76,8 @@ A permutation that results in `(0,0)`: In this permutation of the test execution, `P1` runs to completion before `P0` even starts its execution. For this reason, `P1` observes the initial values of 0 for both the flag and payload. A permutation that results in `(0,1)`: -``` + +```output (P1) LDR W0, [X1] # P1 reads flag, gets 0 (P0) MOV W0, #1 (P0) STR W0, [X1] # P2 writes payload, sets 1 @@ -82,7 +88,8 @@ A permutation that results in `(0,1)`: In this permutation of the test execution, `P1` reads the initial value of the flag (the first line) because this instruction is executed before `P0` writes the flag (the last list). However `P1` reads the payload value of 1 because it executes after `P0` writes the payload to 1 (third and forth lines). A permutation that results in `(1,1)`: -``` + +```output (P0) MOV W0, #1 (P0) STR W0, [X1] # P2 writes payload, sets 1 (P0) MOV W2, #1 @@ -94,9 +101,17 @@ In this permutation of the test execution, `P2` runs to completion before `P1` s Result `(1,0)` is not possible in a Sequentially Consistent execution of this test. This is because the instructions in `P0` and `P1` must execute in program order (a requirement for Sequential Consistency). It is only possible to get the result `(1,0)` if the machine is not Sequentially Consistent. Not being Sequentially Consistent means that the memory access instructions (`STR` and `LDR`) can be executed out of program order if there are no dependencies between them. -Now that we've developed a hypothesis on what we should see assuming a Sequentially Consistent machine, let's try running the test with `herd7`. `herd7` will simulate all the possible instruction permutations to see which outcomes are possible. Below is the output of the test on `herd7`. +Now that you have looked at a hypothesis on what you should see assuming a Sequentially Consistent machine, let's try running the test with `herd7`. `herd7` will simulate all the possible instruction permutations to see which outcomes are possible. +Run the following command: + +```bash +herd7 ./test.litmus ``` + +The output of the test on `herd7` will looks like: + +```output Test MP Allowed States 4 1:X0=0; 1:X2=0; @@ -107,13 +122,13 @@ Ok Witnesses Positive: 1 Negative: 3 ``` +It turns out that you do see outcome `(1,0)` when testing against the Arm memory model. This point is also signaled by the positive witness on the test condition. This tells us that the Arm memory model violates Sequential Consistency. As mentioned earlier, you know this already. The Arm memory model tends to be a Relaxed Consistency model. That means that it is possible for an Arm CPU in the real world to use ordering rules that are stronger than Relaxed Consistency. Going stronger on the memory model will not violate the Arm ordering rules. A stronger memory model means that there might be less opportunity for hardware based optimizations, it will not affect the correctness of code execution (assuming no bugs in the code). That said, all Arm Neoverse CPUs can be thought of as following a Relaxed Consistency model (i.e. acquire-release ordering is supported). -It turns out that we do see outcome `(1,0)` when testing against the Arm memory model. This point is also signaled by the positive witness on our test condition. This tells us that the Arm memory model violates Sequential Consistency. As mentioned earlier, we know this already. The Arm memory model tends to be a Relaxed Consistency model. We are saying Arm "tends to" be a Relaxed Consistency model because it is possible for an Arm CPU in the real world to use ordering rules that are stronger than Relaxed Consistency. Going stronger on the memory model will not violate the Arm ordering rules. A stronger memory model means that there might be less opportunity for HW based optimizations, it will not affect the correctness of code execution (assuming no bugs in the code). That said, all Arm Neoverse CPUs in the market can be thought of as following a Relaxed Consistency model (i.e. acquire-release ordering is supported). - -In a Release Consistency model, ordinary memory accesses like `STR` and `LDR` do not need to follow program order. This relaxation in the ordering rules expands the list of instruction permutations in the litmus test above. It is in these additional instruction permutations allowed by the Relaxed Consistency model that yields at least one permutation that results in `(1,0)`. Below is one such example of a permutation. For this permutation, we reordered the `LDR` instructions in `P1`. +In a Release Consistency model, ordinary memory accesses like `STR` and `LDR` do not need to follow program order. This relaxation in the ordering rules expands the list of instruction permutations in the litmus test above. It is these additional instruction permutations allowed by the Relaxed Consistency model that yield at least one permutation that results in `(1,0)`. Below is one such example of a permutation. For this permutation, the `LDR` instructions in `P1` are reordered. One possible permutation resulting in `(1,0)`: -``` + +```output (P1) LDR W2, [X3] # P1 reads payload, gets 0 (P0) MOV W0, #1 (P0) STR W0, [X1] # P2 writes payload, sets 1 @@ -121,13 +136,21 @@ One possible permutation resulting in `(1,0)`: (P0) STR W2, [X3] # P2 writes flag, sets 1 (P1) LDR W0, [X1] # P1 reads flag, gets 1 ``` -There are more possible permutations of instructions, including some that reorder the `STR` instructions in `P0`. We will not explore those here, this is what `herd7` tests for us anyway. +There are more possible permutations of instructions, including some that reorder the `STR` instructions in `P0`. You will not explore those here, this is what `herd7` tests for. -Now that we see the results against the formal Arm memory model, we can try testing on actual hardware with `litmus7`. Since `litmus7` runs the test against actual HW, it can't guarantee that all instruction permutations are tested. This is because there isn't a way to manipulate the CPUs dynamic instruction scheduler (this is "hardwired" in the CPU design). Instead, we have to run numerous iterations of the test to increase the probability that we see all possible outcomes. Thus, `litmus7` can't formally verify/confirm the memory model of the CPU, it can only provide empirical evidence to support a claim on the CPU's memory model. That said, it can be used to disprove that the CPU does not follow a specific memory model. For example, we can prove that Arm is **not** Sequentially Consistent by finding at least one outcome that would not be allowed by Sequential Consistency. +Now that you see the results against the formal Arm memory model, you can try testing on actual hardware with `litmus7`. Since `litmus7` runs the test against actual hardware, it can't guarantee that all instruction permutations are tested. This is because there isn't a way to manipulate the CPUs dynamic instruction scheduler (this is "hardwired" in the CPU design). Instead, you have to run numerous iterations of the test to increase the probability that you see all possible outcomes. Thus, `litmus7` can't formally verify/confirm the memory model of the CPU, it can only provide empirical evidence to support a claim on the CPU's memory model. That said, it can be used to disprove that the CPU does not follow a specific memory model. For example, you can prove that Arm is **not** Sequentially Consistent by finding at least one outcome that would not be allowed by Sequential Consistency. -Below we run the test 1,000,000 times with `litmus7`. 1,000,000 is the default number of iterations, this can be adjusted with the `-s` switch. It is also possible to run the test on more than one CPU in parallel with the `-a` switch. The more test iterations we run, the more confidence there can be that we've capture all possible outcomes. Below is the result of running the test on an Arm Neoverse system. +You will now run the same `test.litmus` file 1,000,000 times with `litmus7`. 1,000,000 is the default number of iterations, this can be adjusted with the `-s` switch. It is also possible to run the test on more than one CPU in parallel with the `-a` switch. The more test iterations you run, the more confidence there can be that we've capture all possible outcomes. +Run the following command on your Arm Neoverse CPU based instance: + +```bash +litmus7 ./test.litmus ``` + +The output from running the test should look like: + +```output Test MP Allowed Histogram (4 states) 553396:>1:X0=0; 1:X2=0; @@ -137,4 +160,4 @@ Histogram (4 states) ``` `litmus7` reports the same 4 outcomes we saw in the `herd7` simulation. The outcome with the asterisk signals the positive witness of the condition check. It also reports the number of times each outcome was observed. This result suggests that this CPU is in agreement with the formal memory model (tested with `herd7`), at least for this test. To build more confidence, more tests and more test iterations with `litmus7` would be needed. The [herd7 online simulator](https://developer.arm.com/herd7) contains additional tests that can be executed. Additional tests can also be created by the reader. -Let's move on to exploring thread synchronization. +Let's move on to exploring thread synchronization in the next section.