# Multiprocessors and Caching

CS 1541 Wonsun Ahn



#### Two ways to use multiple processors

- Distributed (Memory) System
  - Processors do not share memory (and by extension data)
  - Processors exchange data through network messages
  - Programming standards:
    - Message Passing Interface (MPI) C/C++ API for exchanging messages
    - Ajax (Asynchronous JavaScript and XML) API for web apps
  - Data exchange protocols: TCP/IP, UDP/IP, JSON, XML...
- Shared Memory System (a.k.a. Multiprocessor System)
  - Processors share memory (and by extension data)
  - Programming standards:
    - Pthreads (POSIX threads), Java threads APIs for threading
    - OpenMP Compiler #pragma directives for parallelization
  - o Cache coherence protocol: protocol for exchanging data among caches
  - → Just like Ethernet, caches are part of a larger network of caches



#### Shared Data Review

- What bad thing can happen when you have shared data?
- Dataraces!
  - You should have learned it in CS 449.
  - But if you didn't, don't worry I'll go over it.



```
int shared = 0;
void *add(void *unused) {
  for (int i=0; i < 1000000; i++) { shared++; }
  return NULL;
int main() {
  pthread t t;
  // Child thread starts running add
  pthread create (&t, NULL, add, NULL);
  // Main thread starts running add
  add (NULL);
  // Wait until child thread completes
  pthread join(t, NULL);
  printf("shared=%d\n", shared);
  return 0;
```

```
bash-4.2$ ./datarace
shared=1085894
bash-4.2$ ./datarace
shared=1101173
bash-4.2$ ./datarace
shared=1065494
```

- Q) What do you expect from running this? Maybe shared=2000000?
- A) Nondeterministic result! Due to datarace on shared.



• When two threads do shared++; initially shared = 1





• When two threads do shared++; initially shared = 1





When two threads do shared++; initially shared = 1





When two threads do shared++; initially shared = 1





• When two threads do shared++; initially shared = 1





• When two threads do shared++; initially shared = 1





When two threads do shared++; initially shared = 1





- Why did this occur in the first place?
- Because data was replicated to CPU registers and each worked on its own copy!
  - When two threads do shared++; initially shared = 1





```
pthread mutex t lock;
int shared = 0;
void *add(void *unused) {
  for (int i=0; i < 1000000; i++) {
      pthread mutex lock(&lock);
      shared++;
      pthread mutex unlock(&lock);
  return NULL;
int main() {
```

```
bash-4.2$ ./datarace
shared=2000000
bash-4.2$ ./datarace
shared=2000000
bash-4.2$ ./datarace
shared=2000000
```

- Data race is fixed! Now shared is always 2000000.
- Problem solved? No! CPU registers is not the only place **replication** happens!



Pittsburgh

• What happens if caches sit in between processors and memory?



• Let's say CPU 0 first fetches shared for incrementing



• Then CPU 0 increments shared 100 times to 100





Pittsburgh

• Then CPU 2 gets hold of the mutex and fetches shared from L3



• Then CPU 2 increments shared 10 times to 10





Pittsburgh

• Clearly this is wrong. L1 caches of CPU 0 and CPU 2 are incoherent.



#### Cache Incoherence: Problem with Private Caches

- This problem does not occur with a shared cache.
  - All processors share and work on a single copy of data.

shared 0 ink 0 Bank 1 Bank 2 Bank 3 Shared L3\$

- The problem exists for **private** caches.
  - Private copy is at times inconsistent with lower memory.
  - o **Incoherence** occurs when private copies differ from each other.
    - → Means processors return different values for same location!



# Cache Coherence



#### Cache Coherence

- Cache coherence (loosely defined):
  - All processors of system should see the same view of memory
  - o Copies of values cached by processors should adhere to this rule
- Each ISA has a different definition of what that "view" means
  - Memory consistency model: definition of what that "view" is
- All models agree on one thing:
  - That a change in value should reflect on all copies (eventually)



#### Implementing Cache Coherence

- How to guarantee changes in value are propagated to all caches?
- Cache coherence protocol: A protocol, or set of rules, that all caches must follow to ensure coherence between caches
  - MSI (Modified-Shared-Invalid)
  - MESI (Modified-Exclusive-Shared-Invalid)
  - ... often named after the states in cache controller FSM
- Three states of **MSI** protocol (maintained for each block):
  - Modified: Dirty. Only this cache has copy.
  - Shared: Clean. Other caches may have copy.
  - Invalid: Block contains no data.



# MSI Snoopy Cache Coherence Protocol



- Each processor monitors (snoops) the activity on the bus
  - o In much the same way as how nodes snoop the Ethernet
- Cache state changes in response to both:
  - Read / writes from the local processor
  - Read misses / write misses from remote processors it snoops





1. CPU 0: line in Modified state, CPU 1~3: line in Invalid state





- 1. CPU 0: line in Modified state, CPU 1~3: line in Invalid state
- 2. CPU 1 reads line, causing a Read Miss





- 1. CPU 0: line in Modified state, CPU 1~3: line in Invalid state
- 2. CPU 1 reads line, causing a Read Miss
- 3. A Read message is broadcast on the Bus





- 1. CPU 0: line in Modified state, CPU 1~3: line in Invalid state
- 2. CPU 1 reads line, causing a Read Miss
- 3. A Read message is broadcast on the Bus
- 4. CPU 0 snoops message and provides line to CPU1
  - Line in CPU 0 and CPU 1 both transition to Shared state





- Subsequent read hits on CPU 0 or CPU 1 don't generate messages
- Only read misses generate messages, not read hits
  - → Reduces bus bandwidth pressure since misses are rare!





1. CPU 1 writes to line, causing a Write Hit





- 1. CPU 1 writes to line, causing a Write Hit
- 2. An Invalidate message is broadcast on the Bus
  - To remove all copies of the line that may become inconsistent





- 1. CPU 1 writes to line, causing a Write Hit
- 2. An Invalidate message is broadcast on the Bus
  - To remove all copies of the line that may become inconsistent
- 3. CPU 0 snoops message and invalidates its line





- 1. CPU 1 writes to line, causing a Write Hit
- 2. An Invalidate message is broadcast on the Bus
  - To remove all copies of the line that may become inconsistent
- 3. CPU 0 snoops message and invalidates its line
- 4. Line in CPU 1 transitions to Modified state





- Subsequent write hits on CPU 1 don't generate messages
- Only write hit in Shared state generate messages
  - → Reduces bandwidth since most write hits are to Modified state





1. CPU 0 writes to line, causing a Write Miss





- 1. CPU 0 writes to line, causing a Write Miss
- 2. A ReadX (read exclusive) message is broadcast on the Bus
  - To read line into CPU 0 before writing to it (remember?)
  - o To remove all copies of the line that may become inconsistent





- 1. CPU 0 writes to line, causing a Write Miss
- 2. A ReadX (read exclusive) message is broadcast on the Bus
  - To read line into CPU 0 before writing to it (remember?)
  - To remove all copies of the line that may become inconsistent
- 3. CPU 1 snoops message and provides line to CPU 0
  - o Line in CPU 1 is invalidated before line in CPU 0 is modified



- Subsequent write hits on CPU 0 don't generate messages
- Write miss generates message, but not subsequent write hits
  - → Reduces bus bandwidth pressure since misses are rare!



#### MSI: Example

- Note that caches are kept coherent at all times (same view of memory)
- Caches maintain coherence by monitoring bus activity

| Event                      | P1's cache                               | P2's cache                               |
|----------------------------|------------------------------------------|------------------------------------------|
|                            | L = invalid                              | L = invalid                              |
| P1 writes 10 to A          |                                          | Read Exclusive A (from write in P1)      |
| (write miss)               | $L \leftarrow A = 10 \text{ (modified)}$ | L = invalid                              |
| P1 reads A                 |                                          |                                          |
| (read hit)                 | $L \leftarrow A = 10 \text{ (modified)}$ | L = invalid                              |
| P2 reads A                 | Read A (from read in P2)                 |                                          |
| (read miss)                | L ← A = 10 (shared)                      | L ← A = 10 (shared)                      |
| P2 writes 20 to A          | Invalidate A (from write in P2           |                                          |
| (write hit)                | L = invalid                              | $L \leftarrow A = 20 \text{ (modified)}$ |
| P2 writes 40 to A          |                                          |                                          |
| (write hit)                | L = invalid                              | $L \leftarrow A = 30 \text{ (modified)}$ |
| P1 write 50 to A           |                                          | Read Exclusive A (from write in P1)      |
| University of (write miss) | $L \leftarrow A = 40 \text{ (modified)}$ | L = invalid                              |

#### Cache Controller FSM for MSI Protocol

Processor activity in red, Bus activity in blue



# TLB Coherence



#### How about TLBs?

- We said TLBs are also a type of cache that caches PTEs.
- TLBs need to be kept coherent when page mapping changes
  - Specifically when page is unmapped or page protection changes
  - o Or, program will have illegal access to that physical page
  - Not only TLB of local CPU but TLBs of all CPUs in the system
- Unfortunately, there is no hardware coherence for TLBs. ②
  - Unlike caches, HW doesn't know when TLB entries need updating
- That means software (the OS) must handle the coherence
  - Which is of course much much slower



#### TLB shootdown

- In order to update a PTE (page table entry)
  - Initiator OS must first flush its own TLB
  - Send IPIs (Inter-processor interrupts) to other processors
    - To flush the TLBs for all other processors too
  - Source of significant performance overhead

TLB Flushes in Linux and FreeBSD



\* Courtesy of Nadav Amit et al. at VMWare

