# CSE113: Parallel Programming

#### • Topics:

- Finish up GPUs
- Homework 4
- Start on memory models



#### Announcements

#### **Grading**

• Grades for HW 3 should be out by the end of the week

• HW 4 is released.

What is a warp?

- Group of 32 threads in a GPU
- Group of 16 threads in a GPU
- Group of 2 threads in a GPU
- $\bigcirc$  Group of some threads in a GPU



What is a warp?

- Group of 32 threads in a GPU
  - Group of 16 threads in a GPU
  - Group of 2 threads in a GPU
- -> Group of some threads in a GPU

Like the CPU cache, the Load/Store Unit reads in memory in chunks. Is this affirmation true or false?

- True
- False

4 cores are accessing memory. What can happen

#### **Read contiguous values**

Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

Can easily distribute the values to the threads

1 request to GPU memory



Like the CPU cache, the Load/Store Unit reads in memory in chunks. Is this affirmation true or false?

- - False

What could we observe with the demonstrations made in class about memory accesses on GPUs?

#### Chunked Pattern

the first element accessed by the 4 threads sharing a load store unit. What sort of access is this?

Computation can easily be divided into threads

Thread 0 - Blue

Thread 1 - Yellow

Thread 2 - Green

Thread 3 - Orange



How can we fix this

### Stride Pattern



### Stride Pattern



4 cores are accessing memory. What can happen

#### **Read contiguous values**

Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

Can easily distribute the values to the threads

1 request to GPU memory



What could we observe with the demonstrations made in class about memory accesses on GPUs?

Memory coalescing accelerates execution.

Why do we need to calculate int i = blockIdx.x \* blockDim.x + threadIdx.x?

- $\bigcirc$  To get the index in the matrix we want to compute in that thread
- O To get the whole matrix we want to compute in that thread
- O To get the index in the matrix we want to compute in that warp

Why do we need to calculate int i = blockIdx.x \* blockDim.x + threadIdx.x?

- -> O To get the index in the matrix we want to compute in that thread
  - O To get the whole matrix we want to compute in that thread
  - O To get the index in the matrix we want to compute in that warp

### Homework 4 - first look

- Prerequisits
  - Google Chrome
  - Should be stable on Linux, Windows and Mac.
- No docker container for this assignment.

#### Homework 4 - first look

- Javascript shared array buffer:
  - How javascript threads can actually share memory
  - Similar to memory in C++

```
var buffer = new SharedArrayBuffer(Float32Array.BYTES_PER_ELEMENT * NUM);
var array = new Float32Array(buffer);
```

## Shared Array Buffer

• Like Malloc, allocates a "pointer" to a contagious array of bytes

• Can pass the "pointer" to different threads

Need to instantiate a typed array to access the values

#### Web Workers

How to do multi-threading in javascript

- Async
  - Concurrent (executes on the same thread)
  - Good for I/O and user interactions
- Web Workers will execute on multiple cores
  - Better for compute intensive applications
  - Better performance

#### How to use?

- Create a new worker with a file
  - Doesn't do anything yet
- File contains a function: "on message"
- Main file calls "post message" along with arguments to start the thread
- Worker sends a message back to the main file, it can catch the data

- The language is wgsl (WebGPU Shading Language)
  - It is new, there are not many examples (and the specification changes!)
  - Official specification is here: https://www.w3.org/TR/WGSL/

- wgsl is NOT javascript
- Javascript is interpreted: not possible on GPUs
- wgsl is compiled
  - into Vulkan on Linux
  - into Metal on Apple
  - into HLSL on Windows
- No printing (so GPU code can be difficult to debug)

variables (optional types):

```
var <name> = <value>;
var cluster_dist = 3.0;

var <name> : <type> = <value>;
var cluster_dist : f32 = 3.0;
```

types:

• i32

• u32

• f32

vec2<f32>

array<type>

```
struct Particle {
    pos : vec2<f32>;
struct Particles {
    particles : array<Particle>;
var index_pos : vec2<f32> = particlesA.particles[index].pos;
var index : u32 = GlobalInvocationID.x;
```

structures

• Built-ins (global id) you have one thread for each particle!

- Built in functions:
  - arrayLength
  - sqrt
  - pow
  - distance

#### For loops:

```
for (var i : u32 = 0u; i < arrayLength(&particlesA.particles); i = i + 1u) {
...
}</pre>
```

Types can be frustrating

• But compiler errors will help you, and you can do casts.

### Moving on to memory consistency!

• One of my favorite topics!

### Moving on to memory consistency!

One of my favorite topics!

What do other people think?

Look, memory ordering pretty much \_is\_ the rocket science of CS,

**Linus Torvalds** 

### Memory Consistency

- We have been very strict about using atomic types in this class
  - and the methods (.load and .store)
  - why?
  - Architectures do very strange things with memory loads and stores
  - Compilers do too (but we won't talk too much about them today)
  - C++ gives us sequential consistency if we use atomic types and operations
  - What do we remember sequential consistency from?

## Sequential consistency for atomic memory

• Let's play our favorite game:

#### Global variable:

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
Thread 1:
int t0 = y.load();
int t1 = x.load();
```

#### Global variable:

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
Is it possible for
  t0 == 0 \text{ and } t1 == 1
int t0 = y.load();
   x.store(1);
int t1 = x.load();
    y.store(1);
```

```
Thread 1:
int t0 = y.load();
int t1 = x.load();
```

#### Global variable:

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
Is it possible for
  t0 == 0 \text{ and } t1 == 1
int t0 = y.load();
    x.store(1);
    y.store(1);
int t1 = x.load();
```

```
Thread 1:
int t0 = y.load();
int t1 = x.load();
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

x.store(1);

#### **How about:**

```
Is it possible for
  t0 == 1 \text{ and } t1 == 0
   y.store(1);
int t0 = y.load();
int t1 = x.load();
```

```
Thread 1:
int t0 = y.load();
int t1 = x.load();
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Thread 0:
x.store(1);
y.store(1);
```

```
x.store(1);
```

no where for this one to go!

#### **How about:**

```
Is it possible for
 t0 == 1 \text{ and } t1 == 0
   y.store(1);
 int t0 = y.load();
int t1 = x.load();
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
x.store(1);
int t0 = y.load();
```

```
<u>Thread 1:</u>
y.store(1);
int t1 = x.load();
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Another test Can t0 == t1 == 0?
```

```
Thread 0:
x.store(1);
int t0 = y.load();
```

x.store(1);

```
int t0 = y.load();

y.store(1);

int t1 = x.load();
```

#### *Thread 1:*

```
y.store(1);
int t1 = x.load();
```

```
atomic_int x(0);
atomic_int y(0);
```

```
Another test
Can t0 == t1 == 0?
```

```
Thread 0:
x.store(1);
int t0 = y.load();
```

x.store(1);

```
int t0 = y.load();
y.store(1);
int t1 = x.load();
```

no place for this one!

#### Thread 1:

```
y.store(1);
int t1 = x.load();
```

## C++

• Plain atomic accesses are documented to be sequentially consistent (SC)

- Why wasn't SC very good for concurrent data structures?
  - Compossibility: two objects that are SC might not be SC when used together
  - Programs contain only 1 shared memory though; no reason to compose different main memories.

# What about ISAs?

 Remember, it is important for us to understand how our code executes on the architecture to write high performing programs

- Lets think about x86
  - Instructions:
  - MOV %t0 [x] loads the value at x to register t0
  - MOV [y] 1 stores the value 1 to memory location y

```
int x[1] = \{0\};
int y[1] = \{0\};
```

# Another test Can t0 == t1 == 0?

#### Thread 0:

```
mov [x], 1
mov %t0, [y]
```

#### Thread 1:

```
int x[1] = \{0\};
int y[1] = \{0\};
```

## Thread 0:

```
mov [x], 1
mov %t0, [y]
```

```
mov [x], 1
```

#### Another test

Can t0 == t1 == 
$$0$$
?

#### Thread 1:

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

```
mov [x], 1
mov %t0, [y]
```

#### Another test



#### Thread 1:

```
mov [y], 1
mov %t1, [x]
```

mov [y], 1

no place for this event!

## ISA is not SC

- We'd like to be able to compile atomic instructions just to regular ISA loads and stores
- What if we actually run this code?

# Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency

mov [x], 1

mov %t0, [y]

Core 0

## Thread 1:

mov [y], 1

mov %t1, [x]

Core 1

x:0

y:0

#### Thread 1:

mov %t0, [y]

mov %t1, [x]

Core 0

mov [x], 1

execute first instruction what happens to the stores?

Core 1 mov [y], 1

x:0

y:0

#### Thread 1:

mov %t0, [y]

X86 cores contain a store buffer; holds stores before going to main memory

mov %t1, [x]

Core 0

Store Buffer x:1

Store Buffer
y:1

Core 1

x:0 y:0

# <u>Thread 0:</u>

y:0

#### Thread 1:

buffer; holds stores before mov %t1, [x] mov %t0, [y] going to main memory Store Buffer Store Buffer Core 0 Core 1 x:1 y:1 eventually they flush to main memory x:0

**Main Memory** 

X86 cores contain a store

## Thread 1:



# <u>Thread 0:</u>

mov [x], 1

mov %t0, [y]

rewind

Core 0

Store Buffer

Thread 1:

mov [y], 1

mov %t1, [x]

Store Buffer

Core 1

## Thread 1:

mov %t1, [x]

mov %t0, [y]

Store Buffer

Core 0

mov [x], 1

execute first instruction



#### Thread 1:

mov %t0, [y]

values get stored in SB

mov %t1, [x]

Core 0

Store Buffer x:1 Store Buffer y:1

Core 1

Thread 1:

#### Execute next instruction





#### Values get loaded from memory



## Thread 1:

#### we see t0 == t1 == 0!



#### Store buffers are drained eventually



Store buffers are drained eventually but we've already done our loads

Store Buffer Store Buffer Core 0 Core 1 x:1 Main Memory y:1

# Our first relaxed memory execution!

Relaxed memory (also known as Weak memory) behaviors

An execution that is NOT allowed by sequential consistency

- Relaxed memory model: a memory model that allows relaxed memory executions
  - X86 has a relaxed memory model due to store buffering
  - If you restrict yourself to use only default atomic operations, C++ does NOT have a weak memory model

## Litmus tests

Small concurrent programs that check for relaxed memory behaviors

 Vendors have a long history of under documented memory consistency models

- Academics have empirically explored the memory models
  - X86 behaviors were documented by researchers before Intel!
  - Many vendors have unofficially endorsed academic models

# Litmus tests

#### This test is called "store buffering"

```
<u>Thread 0:</u>
mov [x], 1
mov %t0, [y]
```

```
<u>Thread 1:</u>
mov [y], 1
mov %t1, [x]
```

```
Can t0 == t1 == 0?
```

# Fences: restoring sequential consistency

• It is typical that relaxed memory models provide special instructions which can be used to disallow weak behaviors.

These instructions are called Fences

• The X86 fence that flushes the store buffer is called mfence.

mov [x], 1

mfence

mov %t0, [y]

Core 0

Store Buffer

#### Thread 1:

mov [y], 1

mfence

mov %t1, [x]

Store Buffer

Core 1

x:0 y:0

#### Thread 1:

mfence

mov %t1, [x]



**Execute first instruction** 



#### Thread 1:

mfence

mov %t0, [y]

Core 0

Store Buffer x:1

Values go into the store buffer

Store Buffer y:1 mfence

mov %t1, [x]

Core 1

#### Thread 1:

#### Execute next instruction





## Thread 1:



Thread 1:

store buffers are flushed

mov %t0, [y]

Core 0

Store Buffer

Store Buffer

Core 1

mov %t1, [x]

x:1

y:1

Thread 1:

#### execute next instruction





Thread 1:

#### values are loaded from memory



Thread 1:

We don't get the problematic behavior: t0 == t1 == 0



mov [x], 1

mov %t0, [x]

Core 0

Store Buffer

single thread same address

possible outcomes:

t0 = 1

t0 = 0

Which one do you expect?

x:0

y:0

Main Memory

mov [x], 1

How does this execute?

mov %t0, [x]

Core 0

Store Buffer

x:0

y:0

Main Memory

#### execute first instruction





Store the value in the store buffer

mov %t0, [x]

Core 0

Store Buffer x:1

y:0 Main Memory

#### **Next instruction**





Where to load??

Store buffer? Main memory?







Where to load??

Threads check store buffer before going to main memory

It is close and cheap to check.



# Memory Consistency

How to specify a relaxed memory model?

- We can do it operationally
  - by constructing a high-level machine and reasoning about operations through the machine.

or we can talk about instructions that are allowed to "break" program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

```
<u>Thread 0:</u>
mov [x], 1
mov %t0, [y]
```

```
<u>Thread 1:</u>
mov [y], 1
mov %t1, [x]
```

We will annotate instructions with S for store, and L for loads

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Another test Can t0 == t1 == 0?
```

#### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

## Thread 1:

```
S:mov [y], 1
L:mov %t1, [x]
```

We will annotate instructions with S for store, and L for loads

```
int x[1] = \{0\};
int y[1] = \{0\};
```

# Another test Can t0 == t1 == 0?

#### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

#### Thread 1:

```
S:mov [y], 1
L:mov %t1, [x]
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

# Another test Can t0 == t1 == 0?



#### Thread 1:

```
S:mov [y], 1
L:mov %t1, [x]
```

S:mov [y], 1

Now we make a new rule:

S(tores) followed by a L(oad) do not have to follow program order

```
int x[1] = \{0\};
int y[1] = \{0\};
```

## Thread 0:

```
S:mov[x], 1
L:mov %t0, [y]
```





Now we make a new rule:

S(tores) followed by a L(oad) do not have to follow program order

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

# Another test Can t0 == t1 == 0?



#### Thread 1:

S:mov [y], 1 L:mov %t1, [x]

Now we can satisfy the condition!

int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

## Another test Can t0 == t1 == 0?

## Thread 0:

Lets peak under the hood here



int 
$$x[1] = \{0\};$$
  
int  $y[1] = \{0\};$ 

#### Thread 0:

S:mov [x], 1 L:mov %t0, [y]

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads





```
int x[1] = \{0\};
int y[1] = \{0\};
```

### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads Another test
Can t0 == t1 == 0?



```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

```
S:mov [x], 1
L:mov %t0, [y]
```

Lets peak under the hood here

Global timeline is when the Store operation becomes visible to other threads

#### Another test

Can t0 == t1 == 
$$0$$
?

we can ignore this condition!!



#### Thread 1:

S:mov [y], 1

L:mov %t1, [x]

## Questions

• Can stores be reordered with stores?

mov [x], 1

mov [y], 1

Core 0

Store Buffer

x:0

y:0

Main Memory

mov [y], 1

execute the first instruction

Core 0

Store Buffer

mov [x], 1

y:0 Main Memory

mov [y], 1

value goes into store buffer

Core 0

Store Buffer x:1

y:0 Main Memory

mov [y], 1

execute next instruction

Core 0

Store Buffer x:1

x:0

y:0

Main Memory

#### execute next instruction





value goes into the store buffer

Core 0

| Store Buffer |
|--------------|
| x:1          |
| y:1          |
|              |

y:0 Main Memory







## Questions

• Can stores be reordered with stores?

• How do we make rules about mfence?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

```
S:mov [x], 1
mfence
L:mov %t0, [y]
```

```
S:mov [x], 1
```

mfence

```
L:mov %t0, [y]
```

Another test Can t0 == t1 == 0?

#### Thread 1:

```
S:mov [y], 1
mfence
L:mov %t1, [x]
```

```
S:mov [y], 1
```

mfence

```
L:mov %t1, [x]
```

Rules: S(tores) followed by a L(oad) do not have to follow program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

### Thread 0:

```
S:mov [x], 1
mfence
L:mov %t0, [y]
```

So we can't reorder this instruction at all!

Another test Can t0 == t1 == 0?

#### Thread 1:

S:mov [y], 1
mfence
L:mov %t1, [x]



S:mov [y], 1

Rules:

S(tores) followed by a L(oad) do not have to follow program order.

## Rules

• Are we done?

Rules:

S(tores) followed by a L(oad) do not have to follow program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

## Another test Can t0 == 0?

#### Thread 0:

S:mov [x], 1

L:mov %t0, [x]

S:mov [x], 1

L:mov %t0, [x]

Rules:

S(tores) followed by a L(oad) do not have to follow program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Another test
Can t0 == 0?

#### Thread 0:

S:mov [x], 1

L:mov %t0, [x]

S:mov [x], 1

where to put this store?



Rules:

S(tores) followed by a L(oad) do not have to follow program order.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Another test
Can t0 == 0?

### Thread 0:

S:mov [x], 1

L:mov %t0, [x]

where to put this store?



Rules:

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

### TSO - Total Store Order

#### **Rules:**

S(tores) followed by a L(oad) do not have to follow program order.

S(tores) cannot be reordered past a fence in program order

S(tores) cannot be reordered past L(oads) from the same address

### Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency

We can specify them in terms of what reorderings are allowed



We can specify them in terms of what reorderings are allowed

|                 |   | memory access 0 |    |
|-----------------|---|-----------------|----|
|                 |   | L               | S  |
| memory access 1 | L | NO              | NO |
|                 | S | NO              | NO |

#### **Sequential Consistency**

We can specify them in terms of what reorderings are allowed

memory access 0

L
S

NO
Different address

memory access 1

NO
NO
NO

TSO - total store order

We can specify them in terms of what reorderings are allowed

|                 |   | memory access 0 |   |
|-----------------|---|-----------------|---|
|                 |   | L               | S |
| memory access 1 | L | ?               | Ş |
|                 | S | ?               | ? |

#### Weaker models?

We can specify them in terms of what reorderings are allowed



Allows stores to drain from the store buffer in any order

We can specify them in terms of what reorderings are allowed



#### **RMO - Relaxed Memory Order**

If memory access 0 appears before memory access 1 in program order, can it bypass program order?

Very relaxed model!

• FENCE: can always restore order using fences. Accesses cannot be reordered past fences!



#### **Any Memory Model**

If memory access 0 appears before memory access 1 in program order, and there is a FENCE between the two accesses, can it bypass program order?

### Schedule

- Memory consistency models:
  - Total store order
  - Relaxed memory consistency
- Some Examples

```
int x[1] = \{0\};
int y[1] = \{0\};
```

First thing: change our syntax to pseudo code

#### Thread 0:

```
L:mov %t0, [y]
```

S:mov [x], 1

### Thread 1:

```
L:mov %t1, [x]
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

First thing: change our syntax to pseudo code You should be able to find natural mappings to any ISA

#### Thread 0:

L:%t0 = load(y)

S:store (x, 1)

#### *Thread 1:*

L:%t1 = load(x)

S:store(y,1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

#### Thread 0:

```
L:%t0 = load(y)
```

S:store(x, 1)

### Thread 1:

L:%t1 = load(x)

S:store(y,1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == t1 == 1?
```

Get out our lego bricks and try for sequential consistency

#### Thread 0:

```
L:%t0 = load(y)
```

S:store (x, 1)

$$L:%t0 = load(y)$$

S:store (x, 1)

#### Thread 1:

$$L: %t1 = load(x)$$

S:store(y,1)

$$L:%t1 = load(x)$$

S:store(y, 1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

```
L:%t0 = load(y)
```

S:store (x, 1)





satisfy constraints

#### Thread 1:

L: %t1 = load(x)

S:store(y,1)

Not allowed under sequential consistency!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store(x, 1)

L:%t1 = load(x)

S:store(y, 1)

satisfy constraints

memory access 1

#### *Thread 1:*

L:%t1 = load(x)

S:store(y,1)

memory access 0

L

S

| NO | Different<br>address |
|----|----------------------|
| NO | NO                   |

What about TSO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store(x, 1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

S

memory access 0

.

NO Different address

NO NO

What about TSO? NOT ALLOWED!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store (x, 1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

#### Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

NO Different address

NO Different address

What about PSO?

ς

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order



L:%t1 = load(x)

S:store (y, 1)

satisfy constraints

memory access 1

#### Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

NO Different address

NO Different address

What about PSO? NO!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L: %t1 = load(x)

S:store(y, 1)

satisfy constraints

memory access 1

#### Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

5

| YES       | Different<br>address |  |
|-----------|----------------------|--|
| different | Different            |  |
| address   | address              |  |

What about RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store (x, 1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

*Thread 1:* 

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

S

YES Different address

different Different address

What about RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

S

YES Different address

different Different address

What about RMO? YES!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store(x, 1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

How do we disallow the behavior in RMO?

#### Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

S

YES Different address

different Different address

S

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

```
L:%t0 = load(y)
fence
S:store (x, 1)
```

L: %t0 = load(y)



satisfy constraints

#### Thread 1:

memory access 1

L: %t1 = load(x)

S:store(y,1)

memory access 0

Different address

different address

YES

Different address

How do we disallow the behavior in RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks



memory access 1

different

address

Different address

How do we disallow the behavior in RMO?

satisfy constraints

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks



Now we cannot break program order past the fence! Are we done?

satisfy constraints



memory access 1

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks



```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks



Are we done? The behavior is no longer allowed

# One more example

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

#### Thread 0:

S:store (x, 1)S:store (y, 1)

S:store (x, 1)

S:store(y, 1)

#### Thread 1:

L:%t0 = load(y)

S:%t1 = load(x)

L:%t0 = load(y)

L:%t1 = load(x)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

```
S:store (x, 1)
S:store (y, 1)
```

S:store (x, 1)

S:store(y, 1)

Question: can t0 == 1 and t1 == 0?

start off thinking about sequential consistency

#### Thread 1:

L:%t0 = load(y)

S:%t1 = load(x)

L:%t0 = load(y)

L:%t1 = load(x)

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == 1 and t1 == 0?
```

start off thinking about sequential consistency

### Thread 0:

S:store (x, 1)S:store (y, 1)

S:store (x, 1)

respect program order

S:store(y,1)

L:%t0 = load(y)

L:%t1 = load(x)

satisfy constraints

#### Thread 1:

L:%t0 = load(y)

S: %t1 = load(x)

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?



```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?



```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?



What about PSO?

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about PSO? YES

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



Now it is disallowed in PSO

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

#### Thread 0:

```
S:store (x, 1)
```

fence

S:store(y, 1)



```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == 1 and t1 == 0?
```

#### Thread 0:

```
S:store(x,1)
fence
S:store(y,1)
```



What about RMO? The loads can be reordered also!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

#### Thread 0:

```
S:store (x, 1)
fence
```

S:store(y, 1)

L: %t1 = load(x)



Different

address

Different

address

YES

Different

address

What about RMO? add a fence

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == 1 and t1 == 0?
```

#### Thread 0:

```
S:store(x,1)
fence
S:store(y,1)
```



Now the relaxed behavior is disallowed

#### • Historic Chips:

- X86: TSO
  - Surprisingly robust
  - Mutexes and concurrent data structures generally seem to work
  - Watch out for store buffering
- IBM Power and ARM
  - Very relaxed. Similar to RMO with even more rules
  - Mutexes and data structures must be written with care
  - ARM recently strengthened theirs

#### Historic Chips:

- X86: TSO
  - Surprisingly robust
  - Mutexes and concurrent data structures generally seem to work
  - Watch out for store buffering
- IBM Power and ARM
  - Very relaxed. Similar to RMO with even more rules
  - Mutexes and data structures must be written with care
  - ARM recently strengthened theirs

Companies have a history of providing insufficient documentation about their rules: academics have then gone and figured it out!

Getting better these days

- Modern Chips:
  - RISC-V: two specs: one similar to TSO, one similar to RMO
  - Apple M1: toggles between TSO and weaker

- PSO and RMO were never implemented widely
  - I have not met anyone who knows of any RMO taped out chip
  - They are part of SPARC ISAs (i.e. RISC-V before it was cool)
  - These memory models might have been part of specialized chips
- Interestingly:
  - Early Nvidia GPUs appeared to informally implement RMO
- Other chips have very strange memory models:
  - Alpha DEC basically no rules

# Where do programming languages fit in?

- One of the highest priorities of a programming language
  - Write once, run everywhere

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

\_\_\_\_

. NO NO

S NO NO

target machine

. S

, ;

S ?

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

NO NO

S NO NO

target machine TSO (x86)

\_ S

L NO different address

S NO No

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

L !

L NO NO

find mismatch

target machine TSO (x86)

\_ S

NO different address

S

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

L

S

L

S

| _  |    |
|----|----|
| NO | NO |
| NO | NO |

find mismatch

Two options:

make sure stores are not reordered with later loads

make sure loads are not reordered with earlier stores target machine TSO (x86)

. S

NO different address

S

start with both of the grids for the two different memory models



start with both of the grids for the two different memory models



start with both of the grids for the two different memory models





This should help you see why you want to reduce the number of atomic load/stores in your program

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

1

S

| NO | NO |
|----|----|
| NO | NO |

How about this one?

target machine PSO

S

NO different address

NO different address

S

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

L !

NO NO

S NO NO

target machine PSO

\_ S

NO different address

NO different address

S

start with both of the grids for the two different memory models



### Memory orders

- Atomic operations take an additional "memory order" argument
  - memory order seq cst -default
  - memory order relaxed weakest

## Relaxed memory order

language C++11 (sequential consistency)

ı

S

| NO | NO |
|----|----|
| NO | NO |

language C++11 (memory\_order\_relaxed)

. S

different different address different different address address

basically no orderings except for accesses to the same address

language C++11 (memory\_order\_relaxed)

S

address

different different address address different different S

address

target machine TSO (x86)

S

different NO address NO No

S

language C++11 (memory\_order\_relaxed)

L S

L different address different address different address

lots of mismatches!

target machine TSO (x86)

S

NO different address

S



L S

different address address

S different address different address

lots of mismatches!

But language is more relaxed than machine

so no fences are needed

S

target machine TSO (x86)

. S

| NO | different<br>address |
|----|----------------------|
| NO | No                   |

Do any of the ISA memory models need any fences for relaxed memory order?

#### language C++11 (memory\_order\_relaxed) S S S S Different Different Different NO YES NO different different address address address address address Different Different Different S S NO NO NO different different address S address address address address **TSO PSO RMO**

# Memory order relaxed

- Very few use-cases! Be very careful when using it
  - Peeking at values (later accessed using a heavier memory order)
  - Counting (e.g. number of finished threads in work stealing)
  - DO NOT USE FOR QUEUE INDEXES

### More memory orders: we will not discuss in class

- Atomic operations take an additional "memory order" argument
  - memory order seq cst -default
  - memory order relaxed weakest
- More memory orders (useful for mutex implementations):
  - memory order acquire
  - memory order release
- EVEN MORE memory orders (complicated: in most research it is omitted)
  - memory order consume

# A cautionary tale

```
Thread 0:
m.lock();
display.enq(triangle0);
m.unlock();
```

```
Thread 1:
    m.lock();
    display.enq(triangle1);
    m.unlock();
```

```
Thread 0:
m.lock();
display.enq(triangle0);
m.unlock();
```

```
Thread 1:
m.lock();
display.enq(triangle1);
m.unlock();
```

We know how lock and unlock are implemented

```
Thread 0:
SPIN:CAS(mutex, 0, 1);
display.enq(triangle0);
store(mutex, 0);
```

```
Thread 1:
SPIN:CAS (mutex, 0, 1);
display.enq(triangle1);
store(mutex, 0);
```

We know how lock and unlock are implemented We also know how a queue is implemented

```
Thread 0:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 1:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```

We know how lock and unlock are implemented We also know how a queue is implemented

What is an execution?

```
Thread 0:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

CAS (mutex, 0, 1);

if blue goes first
it gets to complete
its critical section
while thread 1 is spinning

# Thread 0: SPIN:CAS (mutex, 0, 1); %i = load (head); store (buffer+i, triangle0); store (head, %i+1); store (mutex, 0);

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

```
CAS (mutex, 0, 1);
%i = load (head);
store (buffer+i, triangle0);
store (head, %i+1);
store (mutex, 0);
```

# Thread 0: SPIN:CAS (mutex, 0, 1); %i = load (head); store (buffer+i, triangle0); store (head, %i+1); store (mutex, 0);

```
Thread 1:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

now yellow gets a change to go

# Thread 0: SPIN:CAS (mutex, 0, 1); %i = load (head); store (buffer+i, triangle0); store (head, %i+1); store (mutex, 0);

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

now yellow gets a change to go

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store (mutex, 0);
```

```
Thread 0:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

NO Different address

NO Different address

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 0:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

L S

NO Different address

NO Different address

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 0:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

NO Different address

NO Different address



What just happened if this store moves?

### Nvidia in 2015

Nvidia architects implemented a weak memory model

Nvidia programmers expected a strong memory model

Mutexes implemented without fences!

## Nvidia in 2015













bug found in two Nvidia textbooks

We implemented a side-channel attack that made the bugs appear more frequently

These days Nvidia has a very well-specified memory model!

```
Thread 0:
SPIN:CAS (mutex, 0, 1);
%i = load (head);
store (buffer+i, triangle0);
store (head, %i+1);
store (mutex, 0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

NO Different address

NO Different address

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

How to fix the issue?

```
Thread 1:
SPIN:CAS (mutex, 0, 1);
%i = load (head);
store (buffer+i, triangle1);
store (head, %i+1);
fence;
unlock contains fence
before store!
```

L S

NO Different address

NO Different address

```
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store (mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store (mutex, 0);
```

CAS (mutex, 0, 1);

How to fix the issue?

your unlock function should contain a fence!

```
Thread 1:
SPIN:CAS (mutex, 0, 1);
%i = load (head);
store (buffer+i, triangle1);
store (head, %i+1);
fence;
unlock contains fence
before store!
```

L S

NO Different address

NO Different address



How to fix the issue?

your unlock function should contain a fence!

## Memory Model Strength

**TSO** 

• If one memory model M0 allows more relaxed behaviors than another memory model M1, then M0 is more *relaxed* (or *weaker*) than M1.

• It is safe to run a program written for M0 on M1. But not vice versa

|   | L  | S                    |   | L  | S                    |   | L                    | S                    |
|---|----|----------------------|---|----|----------------------|---|----------------------|----------------------|
| L | NO | Different<br>address | L | NO | Different<br>address | L | YES                  | Different<br>address |
| S | NO | NO                   | S | NO | Different<br>address | S | Different<br>address | Different<br>address |

**PSO** 

RMO

## Memory Model Strength

**TSO** 

- Many times specifications are weaker than implementations:
  - A chip might document PSO, but implement TSO:

|   | L  | S                    |   | L  | S                    |   | L                    | S                    |
|---|----|----------------------|---|----|----------------------|---|----------------------|----------------------|
| L | NO | Different<br>address | L | NO | Different<br>address | L | YES                  | Different<br>address |
| S | NO | NO                   | S | NO | Different<br>address | S | Different<br>address | Different<br>address |

**PSO** 

**RMO**