# Multiprocessors and Caching

CS 1541 Wonsun Ahn



#### Two ways to use multiple processors

- Distributed (Memory) System
  - Processors do not share memory (and by extension data)
  - o Processors exchange data through network messages
  - Programming standards:
    - Message Passing Interface (MPI) C/C++ API for exchanging messages
    - Ajax (Asynchronous JavaScript and XML) API for web apps
  - Data exchange protocols: TCP/IP, UDP/IP, JSON, XML...
- Shared Memory System (a.k.a. Multiprocessor System)
  - Processors share memory (and by extension data)
  - Programming standards:

University of

- Pthreads (POSIX threads), Java threads APIs for threading
- OpenMP Compiler #pragma directives for parallelization
- o Cache coherence protocol: protocol for exchanging data among caches
- → Just like Ethernet, caches are part of a larger network of caches

#### Shared Data Review

- What bad thing can happen when you have shared data?
- Dataraces!
  - You should have learned it in CS 449.
  - But if you didn't, don't worry I'll go over it.



```
int shared = 0;
void *add(void *unused) {
  for (int i=0; i < 1000000; i++) { shared++; }
  return NULL;
int main() {
 pthread t t;
  // Child thread starts running add
 pthread create (&t, NULL, add, NULL);
  // Main thread starts running add
  add (NULL);
  // Wait until child thread completes
  pthread join(t, NULL);
  printf("shared=%d\n", shared);
  return 0;
```

```
bash-4.2$ ./datarace
shared=1085894
bash-4.2$ ./datarace
shared=1101173
bash-4.2$ ./datarace
shared=1065494
```

- Q) What do you expect from running this? Maybe shared=2000000?
- A) Nondeterministic result! Due to datarace on shared.































- Why did this occur in the first place?
- Because data was replicated to CPU registers and each worked on its own copy!
  - When two threads do shared++; initially shared = 1





```
pthread mutex t lock;
int shared = 0;
void *add(void *unused) {
  for (int i=0; i < 1000000; i++) {
      pthread mutex lock(&lock);
      shared++;
      pthread mutex unlock(&lock);
  return NULL;
int main() {
```

```
bash-4.2$ ./datarace
shared=2000000
bash-4.2$ ./datarace
shared=2000000
bash-4.2$ ./datarace
shared=2000000
```

- Data race is fixed! Now shared is always 2000000.
- Problem solved? No! CPU registers is not the only place **replication** happens!



• What happens if caches sit in between processors and memory?



Pittsburgh

• Let's say CPU 0 first fetches shared for incrementing



• Then CPU 0 increments shared 100 times to 100





• Then CPU 2 gets hold of the mutex and fetches shared from L3





• Then CPU 2 increments shared 10 times to 10



• Clearly this is wrong. L1 caches of CPU 0 and CPU 2 are incoherent.





#### Cache Incoherence: Problem with Private Caches

- This problem does not occur with a shared cache.
  - All processors share and work on a single copy of data.



- The problem exists only with private caches.
- The problem exists for **private** caches.
  - Private copy is at times inconsistent with lower memory.
  - o **Incoherence** occurs when private copies differ from each other.
    - → Means processors return different values for same location!



#### Cache Coherence



#### Cache Coherence

- Cache coherence (loosely defined):
  - All processors of system should see the same view of memory
  - o Copies of values cached by processors should adhere to this rule
- Each ISA has a different definition of what that "view" means
  - o Memory consistency model: definition of what that "view" is
- All models agree on one thing:
  - That a change in value should reflect on all copies (eventually)



#### How Memory Consistent Model affects correctness

• Initially, data == 0, flag == false.

#### **Producer Thread**

#### **Consumer Thread**

```
data == 42;
flag = true;

while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```

- Q) What do you expect the value of data will be when it gets printed?
- A) Most people will say 42 because that is the logical ordering.

But is it? Not always. There are situations where data is still 0!



• Initially, data == 0, flag == false.

#### **Producer Thread**

```
data == 42;
flag = true;
```



#### **Consumer Thread**

```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```



Let's assume initially both data and flag are cached in each CPU's L1 caches.



• Initially, data == 0, flag == false.

#### **Producer Thread**

```
data == 42;
flag = true;
```



#### **Consumer Thread**

```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```



CPU 0 updates both data and flag to 42 and true.



• Initially, data == 0, flag == false.

#### **Producer Thread**

```
data == 42;
flag = true;
```

#### **Consumer Thread**

```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```





Now the cached values in CPU 1 are stale and need to be **invalidated**.

**Invalidation**: act of marking a cache block with stale data invalid.



• Initially, data == 0, flag == false.

#### **Producer Thread**

### data == 42; flag = true;

#### **Consumer Thread**

```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```



The invalidate messages travel through a network and may arrive out-of-order. Let's say invalidate for flag arrives first to CPU 1 and marks flag invalid.



• Initially, data == 0, flag == false.

#### **Producer Thread Consumer Thread** data == 42; while(flag == false) { /\* wait \*/ } flag = true; System.out.println("data=" + data); **Fetch for flag** L1\$ L1\$ Invalidate for data data data flag flag 42 true 0 true

CPU 1 fetches updated flag from CPU 0 when comparing flag == false. Invalidate for data is still traveling through the network.



• Initially, data == 0, flag == false.

# Producer Thread data == 42; flag = true; CPU 0 Fetch for flag L1\$ Consumer Thread while(flag == false) { /\* wait \*/ } System.out.println("data=" + data);

Since flag is true, CPU 1 breaks out of while loop and prints data. data=0 gets printed!

data

0

flag

true

Invalidate for data



data

42

flag

true

• Initially, data == 0, flag == false.

#### **Producer Thread**

```
data == 42;
flag = true;
```



```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```





Let's assume now flag is not cached in CPU 1.

CPU 1 suffers a cache miss on flag when it compares flag == false.



• Initially, data == 0, flag == false.

#### **Producer Thread**

```
data == 42;
flag = true;
```



```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```





Instruction Queue

lw r1, flag (miss)

beq r1, \$zero, \_loop

lw r2, data (hit)

call println on r2

Instead of stalling, CPU 1 predicts the branch not taken and issues lw r2, data. Now, r2 == 0. (Unless pipeline flushes due to branch misprediction.)



• Initially, data == 0, flag == false.

#### **Producer Thread**

## data == 42; flag = true;

#### **Consumer Thread**

```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```



**Fetch for flag** 



Instruction Queue

lw r1, flag (miss)

beq r1, \$zero, \_loop

lw r2, data (hit)

call println on r2

Now let's say CPU 0 updates data and flag before the fetch for flag arrives.

Now, lw r1, flag completes, allowing beq r1, zero, loop to issue (with r1 == true)



• Initially, data == 0, flag == false.

#### **Producer Thread**

```
data == 42;
flag = true;
```

#### **Consumer Thread**

```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```



#### **Fetch for flag**



#### **Instruction Queue**

```
lw r1, flag (miss)
beq r1, $zero, _loop
lw r2, data (hit)
call println on r2
```

Since r1 == true, that validates the not-taken prediction for the branch.

Since r1 == 0, the println outputs data=0!



#### Memory Consistency Models are often very lax

• Initially, data == 0, flag == false.

#### **Producer Thread**

#### **Consumer Thread**

```
data == 42;
flag = true;
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```

- A memory consistency model where above ordering is guaranteed is called, Sequential Consistency (SC): Instructions appear to execute sequentially.
- Real models allow many other orderings to allow optimizations:
  - Write buffers that allow multiple stores to be pending and perform out-of-order
  - Instruction queues that allow loads and other instructions to perform out-of-order
  - Compiler optimizations to reschedule stores and loads out-of-order
- Intel, ARM, Java Virtual Machine all have relaxed memory consistency models
- Moral: never do custom synchronization unless you know what you are doing!



#### Memory Consistency Models are often very lax

Initially, data == 0, flag == false.

#### **Producer Thread**

#### **Consumer Thread**

```
data == 42:
flag = true;
```

```
while(flag == false) { /* wait */ }
System.out.println("data=" + data);
```

- Regardless of memory consistency model, they all agree on one thing: that values of data and flag must be made coherent eventually.
  - They only disagree on when that eventually is.
- This property is called cache coherence.



#### Implementing Cache Coherence

- How to guarantee changes in value are propagated to all caches?
- Cache coherence protocol: A protocol, or set of rules, that all caches must follow to ensure coherence between caches
  - MSI (Modified-Shared-Invalid)
  - MESI (Modified-Exclusive-Shared-Invalid)
  - o ... often named after the states in cache controller FSM
- Three states of **MSI** protocol (maintained for each block):
  - Modified: Dirty. Only this cache has copy.
  - Shared: Clean. Other caches may have copy.
  - Invalid: Block contains no data.



## MSI Snoopy Cache Coherence Protocol



- Each processor monitors (snoops) the activity on the bus
  - In much the same way as how nodes snoop the Ethernet
- Cache state changes in response to both:
  - Read / writes from the local processor
  - Read misses / write misses from remote processors it snoops





1. CPU 0: line in Modified state, CPU 1~3: line in Invalid state





- 1. CPU 0: line in Modified state, CPU 1~3: line in Invalid state
- 2. CPU 1 reads line, causing a Read Miss





- 1. CPU 0: line in Modified state, CPU 1~3: line in Invalid state
- 2. CPU 1 reads line, causing a Read Miss
- 3. A Read message is broadcast on the Bus





- 1. CPU 0: line in Modified state, CPU 1~3: line in Invalid state
- 2. CPU 1 reads line, causing a Read Miss
- 3. A Read message is broadcast on the Bus
- 4. CPU 0 snoops message and provides line to CPU1
  - Line in CPU 0 and CPU 1 both transition to Shared state





- Subsequent read hits on CPU 0 or CPU 1 don't generate messages
- Only read misses generate messages, not read hits
  - → Reduces bus bandwidth pressure since misses are rare!





1. CPU 1 writes to line, causing a Write Hit





- 1. CPU 1 writes to line, causing a Write Hit
- 2. An Invalidate message is broadcast on the Bus
  - o To remove all copies of the line that may become inconsistent





- 1. CPU 1 writes to line, causing a Write Hit
- 2. An Invalidate message is broadcast on the Bus
  - o To remove all copies of the line that may become inconsistent
- 3. CPU 0 snoops message and invalidates its line





- 1. CPU 1 writes to line, causing a Write Hit
- 2. An Invalidate message is broadcast on the Bus
  - o To remove all copies of the line that may become inconsistent
- 3. CPU 0 snoops message and invalidates its line
- 4. Line in CPU 1 transitions to Modified state





- Subsequent write hits on CPU 1 don't generate messages
- Only write hit in Shared state generate messages
  - → Reduces bandwidth since most write hits are to Modified state





1. CPU 0 writes to line, causing a Write Miss





- 1. CPU 0 writes to line, causing a Write Miss
- 2. A ReadX (read exclusive) message is broadcast on the Bus
  - To read line into CPU 0 before writing to it (remember?)
  - To remove all copies of the line that may become inconsistent





- 1. CPU 0 writes to line, causing a Write Miss
- 2. A ReadX (read exclusive) message is broadcast on the Bus
  - To read line into CPU 0 before writing to it (remember?)
  - o To remove all copies of the line that may become inconsistent
- 3. CPU 1 snoops message and provides line to CPU 0
- Line in CPU 1 is invalidated before line in CPU 0 is modified



- Subsequent write hits on CPU 0 don't generate messages
- Write miss generates message, but not subsequent write hits
  - → Reduces bus bandwidth pressure since misses are rare!



#### Cache Controller FSM for MSI Protocol

Processor activity in red, Bus activity in blue



# TLB Coherence



#### How about TLBs?

- We said TLBs are also a type of cache that caches PTEs.
  - So what happens if a processor changes a PTE?
  - How does that change get propagated to other processor TLBs?
- Unfortunately, there is no hardware coherence for TLBs. 🕾
- That means software (the OS) must handle the coherence
  - Which is of course much much slower



#### TLB shootdown

- In order to update a PTE (page table entry)
  - Initiator OS must first flush its own TLB
  - Send IPIs (Inter-processor interrupts) to other processors
    - To flush the TLBs for all other processors too
  - Source of significant performance overhead

TLB Flushes in Linux and FreeBSD



\* Courtesy of Nadav Amit et al. at VMWare

