# CSE113: Parallel Programming

mov [y], 1

- Topics:
  - Finish up weak memory models



# Announcements

HW 3 was graded last week.

- Today is the last day to turn in HW 4 (and 3 more days)
  - Hopefully you had fun with a taste of GPU programming
- HW 5 will be released this week
  - Last day to turn it in is Dec 3
  - It could be useful to work on for the final

# Announcements

# **Final**

- Allowed 3 pages of notes, front and back
- Similar to the midterm
- Time: Dec 12, 12-3pm

# Announcements

SETs are out, please do them! It helps us out a lot.

# Previous quiz + Review

# Memory models

```
int x[1] = \{0\};
int y[1] = \{0\};
```

First thing: change our syntax to pseudo code

### Thread 0:

```
L:mov %t0, [y]
```

S:mov [x], 1

# Thread 1:

```
L:mov %t1, [x]
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

First thing: change our syntax to pseudo code You should be able to find natural mappings to any ISA

#### Thread 0:

L:%t0 = load(y)

S:store (x, 1)

# *Thread 1:*

L:%t1 = load(x)

S:store(y,1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

### Thread 0:

```
L:%t0 = load(y)
```

S:store(x,1)

# Thread 1:

L:%t1 = load(x)

S:store(y, 1)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == t1 == 1?
```

Get out our lego bricks and try for sequential consistency

# Thread 0:

```
L:%t0 = load(y)
```

S:store (x, 1)



### Thread 1:

L: %t1 = load(x)

S:store(y,1)

L:%t1 = load(x)

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

# Thread 0:



respect program order



satisfy constraints

#### Thread 1:

L: %t1 = load(x)

S:store(y, 1)

Not allowed under sequential consistency!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L: %t0 = load(y)$$

S:store(x, 1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L:%t1 = load(x)

S:store(y, 1)

satisfy constraints

memory access 1

#### *Thread 1:*

L:%t1 = load(x)

S:store(y,1)

memory access 0

L

S

| NO | Different<br>address |
|----|----------------------|
| NO | NO                   |

What about TSO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x, 1)

L:%t0 = load(y)

respect program order

S:store(x, 1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L:%t1 = load(x)

S:store(y,1)

S

memory access 0

.

NO Different address

NO NO

What about TSO? NOT ALLOWED!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x, 1)

L:%t0 = load(y)

respect program order

S:store (x, 1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

# Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

NO Different address

NO Different address

What about PSO?

ς

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order



L:%t1 = load(x)

S:store (y, 1)

satisfy constraints

memory access 1

#### Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

NO Different address

NO Different address

What about PSO? NO!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x, 1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L: %t1 = load(x)

S:store(y, 1)

satisfy constraints

memory access 1

# Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

5

| YES       | Different<br>address |
|-----------|----------------------|
| different | Different            |
| address   | address              |

What about RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store (x, 1)

L: %t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

*Thread 1:* 

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

S

YES Different address

different Different address

What about RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store (x, 1)

L:%t0 = load(y)

respect program order

S:store(x,1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

S

YES Different address

different Different address

What about RMO? YES!

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

#### Thread 0:

$$L:%t0 = load(y)$$

S:store(x, 1)

L:%t0 = load(y)

respect program order

S:store(x, 1)

L:%t1 = load(x)

S:store(y,1)

satisfy constraints

memory access 1

How do we disallow the behavior in RMO?

#### Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

L

S

YES Different address

different Different address

S

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

# Thread 0:

$$L:%t0 = load(y)$$



# Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0

YES

Different

address

memory access 1

S

different Different address

How do we disallow the behavior in RMO?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == t1 == 1?

Get out our lego bricks

respect program order

#### Thread 0:

S:store(x, 1)

L:%t0 = load(y)

# Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 1

S:store(
$$x, 1$$
)

fence

L:%t1 = load(x)

S:store(y, 1)

How do we disallow the behavior in RMO?

memory access 0



```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == t1 == 1?
```

Get out our lego bricks

#### Thread 0:

```
L:%t0 = load(y)
```

fence

S:store (x, 1)

Now we cannot break program order past the fence! Are we done?

# Thread 1:

L: %t1 = load(x)

S:store(y,1)

memory access 0



```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == t1 == 1?
```

Get out our lego bricks

#### Thread 0:

```
L:%t0 = load(y)
```

fence

S:store (x, 1)



Now we cannot break program order past the fence! Are we done?

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == t1 == 1?
```

Get out our lego bricks

# Thread 0:

```
L:%t0 = load(y)
fence
S:store(x,1)
```



Now we cannot break program order past the fence! Are we done? The behavior is no longer allowed

# Another example

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == 1 and t1 == 0?
```

#### Thread 0:

```
S:store (x, 1)
S:store (y, 1)
```



#### Thread 1:

L:%t0 = load(y)

L: %t1 = load(x)

Sequential consistency: This interleaving does not allow it.

```
int x[1] = \{0\};
int y[1] = \{0\};
```

#### Thread 0:

```
S:store (x, 1)
S:store (y, 1)
```

S:store (x, 1)

S:store(y, 1)

Question: can t0 == 1 and t1 == 0?

start off thinking about sequential consistency

# Thread 1:

L:%t0 = load(y)

S:%t1 = load(x)

L:%t0 = load(y)

L:%t1 = load(x)

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == 1 and t1 == 0?
```

start off thinking about sequential consistency

# Thread 0:

S:store (x, 1)S:store (y, 1)

S:store (x, 1)

respect program order

S:store(y,1)

L:%t0 = load(y)

L:%t1 = load(x)

satisfy constraints

#### Thread 1:

L:%t0 = load(y)

S: %t1 = load(x)

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about PSO?

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



What about PSO? YES

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



Now it is disallowed in PSO

```
Global variable:
```

```
int x[1] = \{0\};
int y[1] = \{0\};
```



```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

# Thread 0:

```
S:store (x, 1)
```

fence

S:store(y, 1)



#### Global variable:

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == 1 and t1 == 0?
```

#### Thread 0:

```
S:store(x,1)
fence
S:store(y,1)
```



What about RMO? The loads can be reordered also!

#### Global variable:

```
int x[1] = \{0\};
int y[1] = \{0\};
```

Question: can t0 == 1 and t1 == 0?

#### Thread 0:

```
S:store (x, 1)
fence
```

S:store(y, 1)

L: %t1 = load(x)



Different

address

Different

address

YES

Different

address

What about RMO? add a fence

#### Global variable:

```
int x[1] = \{0\};
int y[1] = \{0\};
```

```
Question: can t0 == 1 and t1 == 0?
```

#### Thread 0:

```
S:store(x,1)
fence
S:store(y,1)
```



Now the relaxed behavior is disallowed

#### • Historic Chips:

- X86: TSO
  - Surprising robust
  - Mutexes and concurrent data structures generally seem to work
  - Watch out for store buffering
- IBM Power and ARM
  - Very relaxed. Similar to RMO with even more rules
  - Mutexes and data structures must be written with care
  - ARM recently strengthened theirs

#### • Historic Chips:

- X86: TSO
  - Surprising robust
  - Mutexes and concurrent data structures generally seem to work
  - Watch out for store buffering
- IBM Power and ARM
  - Very relaxed. Similar to RMO with even more rules
  - Mutexes and data structures must be written with care
  - ARM recently strengthened theirs

Companies have a history of providing insufficient documentation about their rules: academics have then gone and figured it out!

Getting better these days

- Modern Chips:
  - RISC-V: two specs: one similar to TSO, one similar to RMO
  - Apple M1: toggles between TSO and weaker

- PSO and RMO were never implemented widely
  - I have not met anyone who knows of any RMO taped out chip
  - They are part of SPARC ISAs (i.e. RISC-V before it was cool)
  - These memory models might have been part of specialized chips
- Interestingly:
  - Early Nvidia GPUs appeared to informally implement RMO
- Other chips have very strange memory models:
  - Alpha DEC basically no rules

## Where do programming languages fit in?

- One of the highest priorities of a programming language
  - Write once, run everywhere

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

NO NO

S NO NO

target machine

. S

, ,

S ?

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

\_\_\_\_\_

NO NO

S NO NO

target machine TSO (x86)

S

NO different address

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

L !

L NO NO

find mismatch

target machine TSO (x86)

\_ S

NO different address

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

L

S

L

S

| _  |    |
|----|----|
| NO | NO |
| NO | NO |

find mismatch

Two options:

make sure stores are not reordered with later loads

make sure loads are not reordered with earlier stores target machine TSO (x86)

. S

NO different address

start with both of the grids for the two different memory models



start with both of the grids for the two different memory models



start with both of the grids for the two different memory models





This should help you see why you want to reduce the number of atomic load/stores in your program

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

1

S

| NO | NO |
|----|----|
| NO | NO |

How about this one?

target machine PSO

S

NO different address

NO different address

start with both of the grids for the two different memory models

language C++11 (sequential consistency)

L !

NO NO

S NO NO

target machine PSO

\_ S

NO different address

NO different address

start with both of the grids for the two different memory models



#### Memory orders

- Atomic operations take an additional "memory order" argument
  - memory order seq cst -default
  - memory order relaxed weakest

#### Relaxed memory order

language C++11 (sequential consistency)

ı

S

| NO | NO |
|----|----|
| NO | NO |

language C++11 (memory\_order\_relaxed)

. S

different different address different different address address

basically no orderings except for accesses to the same address

language C++11 (memory\_order\_relaxed)

S

address

different different address address different different S

address

target machine TSO (x86)

S

different NO address NO No

language C++11 (memory\_order\_relaxed)

L S

L different address different address different address

lots of mismatches!

target machine TSO (x86)

S

NO different address



L S

different address address

S different address different address

lots of mismatches!

But language is more relaxed than machine

so no fences are needed

S

target machine TSO (x86)

. S

| NO | different<br>address |
|----|----------------------|
| NO | No                   |

Do any of the ISA memory models need any fences for relaxed memory order?

#### language C++11 (memory\_order\_relaxed) S S S S Different Different Different NO YES NO different different address address address address address Different Different Different S S NO NO NO different different address S address address address address **TSO PSO RMO**

## Memory order relaxed

- Very few use-cases! Be very careful when using it
  - Peeking at values (later accessed using a heavier memory order)
  - Counting (e.g. number of finished threads in work stealing)

#### More memory orders

- Atomic operations take an additional "memory order" argument
  - memory order seq cst -default
  - memory order relaxed weakest
- More memory orders (useful for mutex implementations):
  - memory order acquire
  - memory order release
- EVEN MORE memory orders (complicated: in most research it is omitted)
  - memory order consume

## More memory orders

```
int x;
atomic<boolean> flag;
x = 17;
flag.store(true,
           memory_model_realase)
                                        if (flag.load(memory_model_acquire))
                                           assert(x == 17);
```

## A cautionary tale

```
Thread 0:
m.lock();
display.enq(triangle0);
m.unlock();
```

```
Thread 1:
    m.lock();
    display.enq(triangle1);
    m.unlock();
```

```
Thread 0:
m.lock();
display.enq(triangle0);
m.unlock();
```

```
Thread 1:
m.lock();
display.enq(triangle1);
m.unlock();
```

We know how lock and unlock are implemented

```
Thread 0:
SPIN:CAS(mutex, 0, 1);
display.enq(triangle0);
store(mutex, 0);
```

```
Thread 1:
SPIN:CAS (mutex, 0, 1);
display.enq(triangle1);
store(mutex, 0);
```

We know how lock and unlock are implemented We also know how a queue is implemented

```
Thread 0:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 1:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```

We know how lock and unlock are implemented We also know how a queue is implemented

What is an execution?

```
Thread 0:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

CAS (mutex, 0, 1);

if blue goes first
it gets to complete
its critical section
while thread 1 is spinning

# Thread 0: SPIN:CAS (mutex, 0, 1); %i = load (head); store (buffer+i, triangle0); store (head, %i+1); store (mutex, 0);

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

```
CAS (mutex, 0, 1);
%i = load (head);
store (buffer+i, triangle0);
store (head, %i+1);
store (mutex, 0);
```

# Thread 0: SPIN:CAS (mutex, 0, 1); %i = load (head); store (buffer+i, triangle0); store (head, %i+1); store (mutex, 0);

```
Thread 1:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex,0);
```

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

now yellow gets a change to go

# Thread 0: SPIN:CAS (mutex, 0, 1); %i = load (head); store (buffer+i, triangle0); store (head, %i+1); store (mutex, 0);

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

now yellow gets a change to go

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store (mutex, 0);
```

```
Thread 0:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

NO Different address

NO Different address

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 0:
SPIN:CAS (mutex, 0, 1);
%i = load (head);
store (buffer+i, triangle0);
store (head, %i+1);
store (mutex, 0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

L S

NO Different address

NO Different address

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 0:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

NO Different address

NO Different address



What just happened if this store moves?

#### Nvidia in 2015

Nvidia architects implemented a weak memory model

Nvidia programmers expected a strong memory model

Mutexes implemented without fences!

## Nvidia in 2015













```
Thread 0:
SPIN:CAS(mutex,0,1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex,0);
```

```
Thread 1:
SPIN:CAS(mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle1);
store(head, %i+1);
store(mutex, 0);
```

NO Different address

NO Different address

```
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store(mutex, 0);
```

How to fix the issue?

```
Thread 1:
SPIN:CAS (mutex, 0, 1);
%i = load (head);
store (buffer+i, triangle1);
store (head, %i+1);
fence;
unlock contains fence
before store!
```

L S

NO Different address

NO Different address

```
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store (mutex, 0);
CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
store (mutex, 0);
```

CAS (mutex, 0, 1);

How to fix the issue?

your unlock function should contain a fence!

```
Thread 0:
SPIN:CAS (mutex, 0, 1);
%i = load(head);
store(buffer+i, triangle0);
store(head, %i+1);
fence;
unlock contains fence
store(mutex, 0);
before store!
```

```
Thread 1:
SPIN:CAS (mutex, 0, 1);
%i = load (head);
store (buffer+i, triangle1);
store (head, %i+1);
fence;
unlock contains fence
before store!
```

L S

NO Different address

NO Different address



How to fix the issue?

your unlock function should contain a fence!

### Memory Model Strength

**TSO** 

• If one memory model M0 allows more relaxed behaviors than another memory model M1, then M0 is more *relaxed* (or *weaker*) than M1.

• It is safe to run a program written for M0 on M1. But not vice versa

|   | L  | S                    |   | L  | S                    |   | L                    | S                    |
|---|----|----------------------|---|----|----------------------|---|----------------------|----------------------|
| L | NO | Different<br>address | L | NO | Different<br>address | L | YES                  | Different<br>address |
| S | NO | NO                   | S | NO | Different<br>address | S | Different<br>address | Different<br>address |

**PSO** 

RMO

### Memory Model Strength

- Many times specifications are weaker than implementations:
  - A chip might document PSO, but implement TSO

|   | L  | S                    |   | L  | S                    |   | L                    | S                    |
|---|----|----------------------|---|----|----------------------|---|----------------------|----------------------|
| L | NO | Different<br>address | L | NO | Different<br>address | L | YES                  | Different<br>address |
| S | NO | NO                   | S | NO | Different<br>address | S | Different<br>address | Different<br>address |

TSO PSO