

#### Microprocessors



## Instruction Level Parallelism (part 5) Dynamic scheduling-Tomasulo

Reading

**Textbook Chapter 3** 

# Part 1 A tomasulo loop based example

### Consider the following loop

- A loop that multiplies the values of an array by a scalar stored in F2. (1.33 for example)
- R1 is pre-loaded with value 80.

```
LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1)

SUBI R1, R1, 8

BNEZ R1, LOOP
```

#### Assumptions

- Multiply takes 4 cycles to execute.
- The first load takes 8 cycles because of a cache miss.
- Successive loads take 1 clock cycle.
- Two loads or two stores cannot be executed at the same time.

#### Whole system status initially

LOOP: L.D F0, 0(R1)
MUL.D F4, F0, F2
S.D F4, 0(R1)
SUBI R1, R1, 8

BNEZ R1, LOOP

| Iteration<br># | Instru | ction | j | k | Issue | Execution<br>Complete | Write<br>Result |
|----------------|--------|-------|---|---|-------|-----------------------|-----------------|
|                |        |       |   |   |       |                       |                 |
|                |        |       |   |   |       |                       |                 |
|                |        |       |   |   |       |                       |                 |
|                |        |       |   |   |       |                       |                 |
|                |        |       |   |   |       |                       |                 |
|                |        |       |   |   |       |                       |                 |
|                |        |       |   |   |       |                       |                 |
|                |        |       |   |   |       |                       |                 |

| Reg | Reg. file |  |  |  |  |  |
|-----|-----------|--|--|--|--|--|
|     | Qi        |  |  |  |  |  |
| F0  |           |  |  |  |  |  |
| F2  | 0         |  |  |  |  |  |
| F4  |           |  |  |  |  |  |
| F6  |           |  |  |  |  |  |
| F8  |           |  |  |  |  |  |
| F10 |           |  |  |  |  |  |

|   | CIOCK |
|---|-------|
|   |       |
|   |       |
|   | R1    |
| , | value |
|   | 80    |

#### Load buffers

|    | Busy | Address |            | Busy | Address | V | Q |
|----|------|---------|------------|------|---------|---|---|
| L1 | 0    |         | S1         | 0    |         |   |   |
| L2 | 0    |         | S2         | 0    |         |   |   |
| L3 | 0    |         | <b>S</b> 3 | 0    |         |   |   |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | А |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | А3         | 0    |    |    |    |    |    |   |

|    | Busy | ор | vj | Vk | Qj | Qk | Α |
|----|------|----|----|----|----|----|---|
| M1 | 0    |    |    |    |    |    |   |
| M2 | 0    |    |    |    |    | 4  |   |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1)

SUBI R1, R1, 8

BNEZ R1, LOOP

| Iteration<br># | Instru | ction | j | k  | Issue | Execution Complete | Write<br>Result |
|----------------|--------|-------|---|----|-------|--------------------|-----------------|
| 1              | L.D    | F0    | 0 | R1 | 1     |                    |                 |
|                |        |       |   |    |       |                    |                 |
|                |        |       |   |    |       |                    |                 |
|                |        |       |   |    |       |                    |                 |
|                |        |       |   |    |       |                    |                 |
|                |        |       |   |    |       |                    |                 |
|                |        |       |   |    |       |                    |                 |
|                |        |       |   |    |       |                    |                 |

| Re  | eg. file | Cloc |
|-----|----------|------|
|     | Qi       | 1    |
| FO  | L1       |      |
| F2  | 0        | R1   |
| F4  |          | valu |
| F6  |          | 80   |
| F8  |          |      |
| F10 |          |      |

#### Load buffers

|    | Busy | Address |    | Busy | Address | V | Q |
|----|------|---------|----|------|---------|---|---|
| L1 | 1    | 80      | S1 | 0    |         |   |   |
| L2 | 0    |         | S2 | 0    |         |   |   |
| L3 | 0    |         | S3 | 0    |         |   |   |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | А |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | <b>A3</b>  | 0    |    |    |    |    |    |   |

|    | Busy | ор | vj | Vk | Q | Qk | Α |
|----|------|----|----|----|---|----|---|
| M1 | 0    |    |    |    |   |    |   |
| M2 | 0    |    |    |    |   | 6  | - |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8 BNEZ R1, LOOP

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|-----------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 2                     |                 |
| 1              | MUL    | F4    | F0 | F2 | 2     |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |

| Re  | eg. file | Clock |
|-----|----------|-------|
|     | Qi       | 2     |
| FO  | L1       |       |
| F2  | 0        | R1    |
| F4  | M1       | value |
| F6  |          | 80    |
| F8  |          |       |
| F10 |          |       |

#### Load buffers

|    | Busy | Address |    | Busy | Address | V | Q |
|----|------|---------|----|------|---------|---|---|
| L1 | 1    | 80      | S1 | 0    |         |   |   |
| L2 | 0    |         | S2 | 0    |         |   |   |
| L3 | 0    |         | S3 | 0    |         |   |   |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | Α |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | А3         | 0    |    |    |    |    |    |   |

|    | Busy | ор  | vj | Vk    | Qj | Qk | Α |
|----|------|-----|----|-------|----|----|---|
| M1 | 1    | MUL |    | R[F2] | L1 |    |   |
| M2 | 0    |     |    |       |    | 7  |   |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1)

SUBI R1, R1, 8

BNEZ R1, LOOP

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|--------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 2                  |                 |
| 1              | MUL    | F4    | F0 | F2 | 2     |                    |                 |
| 1              | S.D    | F4    | 0  | R1 | 3     |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |

| Re  | eg. file |
|-----|----------|
|     | $Q_{i}$  |
| FO  | L1       |
| F2  | 0        |
| F4  | M1       |
| F6  |          |
| F8  |          |
| F10 |          |

Clock 3

R1

value

80

#### Load buffers

|    | Busy | Address |            | Busy | Address | V | Q  |
|----|------|---------|------------|------|---------|---|----|
| L1 | 1    | 80      | S1         | 1    | 80      |   | M1 |
| L2 | 0    |         | S2         | 0    |         |   |    |
| L3 | 0    |         | <b>S</b> 3 | 0    |         |   |    |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | А |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | А3         | 0    |    |    |    |    |    |   |

|    | Busy | ор  | vj | Vk    | Qj | Qk | Α |
|----|------|-----|----|-------|----|----|---|
| M1 | 1    | MUL |    | R[F2] | L1 |    |   |
| M2 | 0    |     |    |       |    | 8  |   |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1)

SUBI R1, R1, 8

BNEZ R1, LOOP

Since this is integer operation, it doesn't enter the tumosalo's architecture

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|--------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 2                  |                 |
| 1              | MUL    | F4    | F0 | F2 | 2     |                    |                 |
| 1              | S.D    | F4    | 0  | R1 | 3     |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |

| Re  | eg. file |
|-----|----------|
|     | Qi       |
| FO  | L1       |
| F2  | 0        |
| F4  | M1       |
| F6  |          |
| F8  |          |
| F10 |          |

| ( | Clock |
|---|-------|
|   | 4     |
|   | R1    |

value 80

| Time |            | Busy | ор | vj | Vk | Qj | Qk | Α |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | <b>A2</b>  | 0    |    |    |    |    |    |   |
|      | <b>A3</b>  | 0    |    |    |    |    |    |   |

|    | Busy | ор  | vj | Vk    | Qj | Qk | Α |
|----|------|-----|----|-------|----|----|---|
| M1 | 1    | MUL |    | R[F2] | L1 |    |   |
| M2 | 0    |     |    |       |    | 9  |   |

#### Load buffers

|    | Busy | Address |            | Busy | Address | V | Q  |
|----|------|---------|------------|------|---------|---|----|
| L1 | 1    | 80      | S1         | 1    | 80      |   | M1 |
| L2 | 0    |         | S2         | 0    |         |   |    |
| L3 | 0    |         | <b>S</b> 3 | 0    |         |   |    |

L.D F0, 0(R1) LOOP:

MUL.D F4, F0, F2

S.D F4, 0(R1)

SUBI R1, R1, 8

BNEZ R1, LOOP



| Re  | eg. file |
|-----|----------|
|     | $Q_{i}$  |
| FO  | L1       |
| F2  | 0        |
| F4  | M1       |
| F6  |          |
| F8  |          |
| F10 |          |

Clock

5

R1 value

**72** 

#### Load buffers

|    | Busy | Address |            | Busy | Address | V | Q  |
|----|------|---------|------------|------|---------|---|----|
| L1 | 1    | 80      | S1         | 1    | 80      |   | M1 |
| L2 | 0    |         | S2         | 0    |         |   |    |
| L3 | 0    |         | <b>S</b> 3 | 0    |         |   |    |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | А |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | А3         | 0    |    |    |    |    |    |   |

|    | Busy | ор  | vj | Vk    | Qj | Qk | Α |
|----|------|-----|----|-------|----|----|---|
| M1 | 1    | MUL |    | R[F2] | L1 |    |   |
| M2 | 0    |     |    |       |    | 10 |   |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8 BNEZ R1, LOOP

#### **Discussion**

What happened here? L2 overrides L1 in the register file. Is this a problem?

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|-----------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 2                     |                 |
| 1              | MUL    | F4    | F0 | F2 | 2     |                       |                 |
| 1              | S.D    | F4    | 0  | R1 | 3     |                       |                 |
| 2              | L.D    | F0    | 0  | R1 | 6     |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |

| Re  | Reg. file |  |  |  |  |  |  |
|-----|-----------|--|--|--|--|--|--|
|     | Qi        |  |  |  |  |  |  |
| FO  | L2        |  |  |  |  |  |  |
| F2  | 0         |  |  |  |  |  |  |
| F4  | M1        |  |  |  |  |  |  |
| F6  |           |  |  |  |  |  |  |
| F8  |           |  |  |  |  |  |  |
| F10 |           |  |  |  |  |  |  |

#### Clock

6

R1 value 72

#### Load buffers

|    | Busy | Address |    | Busy | Address | V | Q  |
|----|------|---------|----|------|---------|---|----|
| L1 | 1    | 80      | S1 | 1    | 80      |   | M1 |
| L2 | 1    | 72      | S2 | 0    |         |   |    |
| L3 | 0    |         | S3 | 0    |         |   |    |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | Α |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | A3         | 0    |    |    |    |    |    |   |

|    | Busy | ор  | vj | Vk    | Qj | Qk | Α |
|----|------|-----|----|-------|----|----|---|
| M1 | 1    | MUL |    | R[F2] | L1 |    |   |
| M2 | 0    |     |    |       |    | 11 |   |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8 BNEZ R1, LOOP

#### **Discussion**

What happened here?
M2 overrides M1 in the register file. Is this a problem?

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|--------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 2                  |                 |
| 1              | MUL    | F4    | F0 | F2 | 2     |                    |                 |
| 1              | S.D    | F4    | 0  | R1 | 3     |                    |                 |
| 2              | L.D    | F0    | 0  | R1 | 6     |                    |                 |
| 2              | MUL    | F4    | F0 | F2 | 7     |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |
|                |        |       |    |    |       |                    |                 |

| Re  | Reg. file |  |  |  |  |  |  |
|-----|-----------|--|--|--|--|--|--|
|     | $Q_{i}$   |  |  |  |  |  |  |
| F0  | L2        |  |  |  |  |  |  |
| F2  | 0         |  |  |  |  |  |  |
| F4  | M2        |  |  |  |  |  |  |
| F6  |           |  |  |  |  |  |  |
| F8  |           |  |  |  |  |  |  |
| F10 |           |  |  |  |  |  |  |

Clock 7

R1 value

72

#### Load buffers

|    | Busy | Address |            | Busy | Address | V | Q  |
|----|------|---------|------------|------|---------|---|----|
| L1 | 1    | 80      | S1         | 1    | 80      |   | M1 |
| L2 | 1    | 72      | S2         | 0    |         |   |    |
| L3 | 0    |         | <b>S</b> 3 | 0    |         |   |    |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | А |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | А3         | 0    |    |    |    |    |    |   |

|    | Busy | ор  | vj | Vk    | Qj | Qk | Α |
|----|------|-----|----|-------|----|----|---|
| M1 | 1    | MUL |    | R[F2] | L1 |    |   |
| M2 | 1    | MUL |    | R[F2] | L2 | 12 |   |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1)

SUBI R1, R1, 8

BNEZ R1, LOOP

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|-----------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 2                     |                 |
| 1              | MUL    | F4    | F0 | F2 | 2     |                       |                 |
| 1              | S.D    | F4    | 0  | R1 | 3     |                       |                 |
| 2              | L.D    | F0    | 0  | R1 | 6     |                       |                 |
| 2              | MUL    | F4    | F0 | F2 | 7     |                       |                 |
| 2              | S.D    | F4    | 0  | R1 | 8     |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |

| Re  | Reg. file |  |  |  |  |  |  |
|-----|-----------|--|--|--|--|--|--|
|     | Qi        |  |  |  |  |  |  |
| FO  | L2        |  |  |  |  |  |  |
| F2  | 0         |  |  |  |  |  |  |
| F4  | M2        |  |  |  |  |  |  |
| F6  |           |  |  |  |  |  |  |
| F8  |           |  |  |  |  |  |  |
| F10 |           |  |  |  |  |  |  |

Clock

8

R1

value 72

#### Load buffers

|    | Busy | Address |    | Busy | Address | V | Q  |
|----|------|---------|----|------|---------|---|----|
| L1 | 1    | 80      | S1 | 1    | 80      |   | M1 |
| L2 | 1    | 72      | S2 | 1    | 72      |   | M2 |
| L3 | 0    |         | S3 | 0    |         |   |    |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | Α |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | A3         | 0    |    |    |    |    |    |   |

|    | Busy | ор  | vj | Vk    | Qj | Qk | Α |  |
|----|------|-----|----|-------|----|----|---|--|
| M1 | 1    | MUL |    | R[F2] | L1 |    |   |  |
| M2 | 1    | MUL |    | R[F2] | L2 | 13 |   |  |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1)

SUBI R1, R1, 8

BNEZ R1, LOOP

In this cycle, SUBI is working and the load from iteration 1 finished execution (it needed 8 cycles as stated)

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|-----------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 29                    |                 |
| 1              | MUL    | F4    | F0 | F2 | 2     |                       |                 |
| 1              | S.D    | F4    | 0  | R1 | 3     |                       |                 |
| 2              | L.D    | F0    | 0  | R1 | 6     |                       |                 |
| 2              | MUL    | F4    | F0 | F2 | 7     |                       |                 |
| 2              | S.D    | F4    | 0  | R1 | 8     |                       |                 |
|                |        |       |    |    |       |                       |                 |
|                |        |       |    |    |       |                       |                 |

| Re  | Reg. file |  |  |  |  |  |
|-----|-----------|--|--|--|--|--|
|     | Qi        |  |  |  |  |  |
| F0  | L2        |  |  |  |  |  |
| F2  | 0         |  |  |  |  |  |
| F4  | M2        |  |  |  |  |  |
| F6  |           |  |  |  |  |  |
| F8  |           |  |  |  |  |  |
| F10 |           |  |  |  |  |  |

| CIOCK       | • |
|-------------|---|
| 9           |   |
| R1<br>value |   |

**72** 

#### Load buffers

|    | Busy | Address |            | Busy | Address | V | Q  |
|----|------|---------|------------|------|---------|---|----|
| L1 | 1    | 80      | S1         | 1    | 80      |   | M1 |
| L2 | 1    | 72      | S2         | 1    | 72      |   | M2 |
| L3 | 0    |         | <b>S</b> 3 | 0    |         |   |    |

| Time |            | Busy | ор | vj | Vk | Qj | Qk | Α |
|------|------------|------|----|----|----|----|----|---|
|      | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|      | A2         | 0    |    |    |    |    |    |   |
|      | A3         | 0    |    |    |    |    |    |   |

|    | Busy | ор  | vj | Vk    | Qj | Qk | Α |
|----|------|-----|----|-------|----|----|---|
| M1 | 1    | MUL |    | R[F2] | L1 |    |   |
| M2 | 1    | MUL |    | R[F2] | L2 | 14 |   |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8

BNEZ R1, LOOP



| Iteration<br># | Instruction |    | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|-------------|----|----|----|-------|-----------------------|-----------------|
| 1              | L.D         | F0 | 0  | R1 | 1     | 29                    | 10              |
| 1              | MUL         | F4 | F0 | F2 | 2     |                       |                 |
| 1              | S.D         | F4 | 0  | R1 | 3     |                       |                 |
| 2              | L.D         | F0 | 0  | R1 | 6     | 10                    |                 |
| 2              | MUL         | F4 | F0 | F2 | 7     |                       |                 |
| 2              | S.D         | F4 | 0  | R1 | 8     |                       |                 |
|                |             |    |    |    |       |                       |                 |
|                |             |    |    |    |       |                       |                 |

| Re  | Reg. file |  |  |  |  |  |
|-----|-----------|--|--|--|--|--|
|     | Qi        |  |  |  |  |  |
| FO  | L2        |  |  |  |  |  |
| F2  | 0         |  |  |  |  |  |
| F4  | M2        |  |  |  |  |  |
| F6  |           |  |  |  |  |  |
| F8  |           |  |  |  |  |  |
| F10 |           |  |  |  |  |  |

Clock

10

R1

value

64

**Time** 

| remaining A1 |  | Busy       | ор | vj | Vk | Qj | Qk | Α |  |
|--------------|--|------------|----|----|----|----|----|---|--|
|              |  | <b>A</b> 1 | 0  |    |    |    |    |   |  |
|              |  | <b>A2</b>  | 0  |    |    |    |    |   |  |
|              |  | А3         | 0  |    |    |    |    |   |  |

|   |    | Busy | ор  | vj      | Vk    | Qj | Qk | Α |
|---|----|------|-----|---------|-------|----|----|---|
| 4 | M1 | 1    | MUL | MEM[80] | R[F2] |    |    |   |
|   | M2 | 1    | MUL |         | R[F2] | L2 | 15 |   |

#### Load buffers

|    | Busy | Address |    | Busy | Address | V | Q  |
|----|------|---------|----|------|---------|---|----|
| L1 | 0    |         | S1 | 1    | 80      |   | M1 |
| L2 | 1    | 72      | S2 | 1    | 72      |   | M2 |
| L3 | 0    |         | S3 | 0    |         |   |    |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8

BNEZ R1, LOOP

#### In this cycle:

- L.D of third iteration is issued because we have load empty buffers.
- L.D of iteration 2 finishes (it needs 1 cycle now)

| Iteration<br># | Instruction |    | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|-------------|----|----|----|-------|-----------------------|-----------------|
| 1              | L.D         | F0 | 0  | R1 | 1     | 29                    | 10              |
| 1              | MUL         | F4 | F0 | F2 | 2     | 11                    |                 |
| 1              | S.D         | F4 | 0  | R1 | 3     |                       |                 |
| 2              | L.D         | F0 | 0  | R1 | 6     | 10                    | 11              |
| 2              | MUL         | F4 | F0 | F2 | 7     |                       |                 |
| 2              | S.D         | F4 | 0  | R1 | 8     |                       |                 |
| 3              | L.D         | F0 | 0  | R1 | 11    |                       |                 |
|                |             |    |    |    | _     |                       | _               |

| Re  | eg. file | Clock |
|-----|----------|-------|
|     | Qi       | 11    |
| F0  | L3       |       |
| F2  | 0        | R1    |
| F4  | M2       | value |
| F6  |          | 64    |
| F8  |          |       |
| F10 |          |       |

#### Time

| re | remaining |            | Busy | ор | vj | Vk | Qj | Qk | Α |
|----|-----------|------------|------|----|----|----|----|----|---|
|    |           | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|    |           | A2         | 0    |    |    |    |    |    |   |
|    |           | A3         | 0    |    |    |    |    |    |   |

|   |   |    | Busy | ор  | vj      | Vk    | Qj | Qk | Α |
|---|---|----|------|-----|---------|-------|----|----|---|
|   | 3 | M1 | 1    | MUL | MEM[80] | R[F2] |    |    |   |
| 1 | 4 | M2 | 1    | MUL | MEM[72] | R[F2] |    | 16 |   |

#### Load buffers

|    | Busy | Address |    | Busy | Address | V | Q  |
|----|------|---------|----|------|---------|---|----|
| L1 | 0    |         | S1 | 1    | 80      |   | M1 |
| L2 | 0    |         | S2 | 1    | 72      |   | M2 |
| L3 | 1    | 64      | S3 | 0    |         |   |    |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8 BNEZ R1, LOOP

#### In this cycle:

- MUL.D of iteration 2 starts executing.
- Can we issue MUL.D of the third iteration?

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|-----------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 29                    | 10              |
| 1              | MUL    | F4    | F0 | F2 | 2     | 11                    |                 |
| 1              | S.D    | F4    | 0  | R1 | 3     |                       |                 |
| 2              | L.D    | F0    | 0  | R1 | 6     | 10                    | 11              |
| 2              | MUL    | F4    | F0 | F2 | 7     | 12                    |                 |
| 2              | S.D    | F4    | 0  | R1 | 8     |                       |                 |
| 3              | L.D    | F0    | 0  | R1 | 11    | 12                    |                 |
|                |        |       |    |    |       |                       |                 |

| Re  | eg. file |
|-----|----------|
|     | Qi       |
| FO  | L3       |
| F2  | 0        |
| F4  | M2       |
| F6  |          |
| F8  |          |
| F10 |          |

Clock

12

**R1** 

value

64

Time

|   | 1 11110  |            |      |    |    |    |    |    |   |
|---|----------|------------|------|----|----|----|----|----|---|
| r | emaining | 9          | Busy | ор | vj | Vk | Qj | Qk | Α |
|   |          | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|   |          | A2         | 0    |    |    |    |    |    |   |
|   |          | A3         | 0    |    |    |    |    |    |   |

|   |   |    | Busy | ор  | vj      | Vk    | Qj | Qk | Α |
|---|---|----|------|-----|---------|-------|----|----|---|
|   | 2 | M1 | 1    | MUL | MEM[80] | R[F2] |    |    |   |
| 1 | 3 | M2 | 1    | MUL | MEM[72] | R[F2] |    | 17 |   |

#### Load buffers

|    | Busy | Address |            | Busy | Address | V | Q  |
|----|------|---------|------------|------|---------|---|----|
| L1 | 0    |         | S1         | 1    | 80      |   | M1 |
| L2 | 0    |         | S2         | 1    | 72      |   | M2 |
| L3 | 1    | 64      | <b>S</b> 3 | 0    |         |   |    |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2 S.D F4, 0(R1)

SUBI R1, R1, 8 BNEZ R1, LOOP

#### In this cycle:

- Can we issue S.D of iteration 3?
- → NO! once an instruction in the queue is stuck, we don't issue after it. We must adhere to FIFO order

| Iteration<br># | Instru | ction | j  | k  | Issue | Execution Complete | Write<br>Result |
|----------------|--------|-------|----|----|-------|--------------------|-----------------|
| 1              | L.D    | F0    | 0  | R1 | 1     | 29                 | 10              |
| 1              | MUL    | F4    | F0 | F2 | 2     | 11                 |                 |
| 1              | S.D    | F4    | 0  | R1 | 3     |                    |                 |
| 2              | L.D    | F0    | 0  | R1 | 6     | 10                 | 11              |
| 2              | MUL    | F4    | F0 | F2 | 7     | 12                 |                 |
| 2              | S.D    | F4    | 0  | R1 | 8     |                    |                 |
| 3              | L.D    | F0    | 0  | R1 | 11    | 12                 | 13              |
|                |        |       |    |    |       |                    |                 |

| Re  | eg. file       |   | 13     |
|-----|----------------|---|--------|
|     | Q <sub>i</sub> |   |        |
| F0  | 0              | Ŀ | nem[64 |
| F2  | 0              |   |        |
| F4  | M2             |   |        |
| F6  |                |   |        |
| F8  |                |   | R1     |
| F10 |                | \ | value  |
|     |                |   | 64     |

Clock

#### Load buffers

#### Store buffers

|    | Busy | Address |            | Busy | Address | V | Q  |
|----|------|---------|------------|------|---------|---|----|
| L1 | 0    |         | S1         | 1    | 80      |   | M1 |
| L2 | 0    |         | S2         | 1    | 72      |   | M2 |
| L3 | 0    |         | <b>S</b> 3 | 0    |         |   |    |

Time

| r | emainin | g          | Busy | ор | vj | Vk | Qj | Qk | Α |
|---|---------|------------|------|----|----|----|----|----|---|
|   |         | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|   |         | A2         | 0    |    |    |    |    |    |   |
|   |         | А3         | 0    |    |    |    |    |    |   |

|   |   |    | Busy | ор  | vj      | Vk    | Qj | Qk | Α |
|---|---|----|------|-----|---------|-------|----|----|---|
| ı | 1 | M1 | 1    | MUL | MEM[80] | R[F2] |    |    |   |
| 1 | 2 | M2 | 1    | MUL | MEM[72] | R[F2] |    | 18 |   |

L.D F0, 0(R1) LOOP:

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8 BNEZ R1, LOOP

#### In this cycle:

• the first MUL.D finished execution!

| Iteration<br># | Instruction |    | j  | k  | Issue | Execution Complete | Write<br>Result |  |
|----------------|-------------|----|----|----|-------|--------------------|-----------------|--|
| 1              | L.D         | F0 | 0  | R1 | 1     | 29                 | 10              |  |
| 1              | MUL         | F4 | F0 | F2 | 2     | 1114               |                 |  |
| 1              | S.D         | F4 | 0  | R1 | 3     |                    |                 |  |
| 2              | L.D         | F0 | 0  | R1 | 6     | 10                 | 11              |  |
| 2              | MUL         | F4 | F0 | F2 | 7     | 12                 |                 |  |
| 2              | S.D         | F4 | 0  | R1 | 8     |                    |                 |  |
| 3              | L.D         | F0 | 0  | R1 | 11    | 12                 | 13              |  |
|                |             |    |    |    |       |                    |                 |  |

| Re  | eg. file |
|-----|----------|
|     | Qi       |
| F0  | 0        |
| F2  | 0        |
| F4  | M2       |
| F6  |          |
| F8  |          |
| F10 |          |

| U  | R1    |
|----|-------|
| M2 | value |
|    | 64    |
|    |       |

Clock

14

#### Load buffers

|    | Busy | Address |            | Busy | Address | V | Q  |
|----|------|---------|------------|------|---------|---|----|
| L1 | 0    |         | S1         | 1    | 80      |   | M1 |
| L2 | 0    |         | S2         | 1    | 72      |   | M2 |
| L3 | 0    |         | <b>S</b> 3 | 0    |         |   |    |

| Γ | Ī | r | Y | 1 | E | • |
|---|---|---|---|---|---|---|
|   |   |   |   |   |   |   |

| remaining A1 |  | Busy       | ор | vj | Vk | Qj | Qk | Α |  |
|--------------|--|------------|----|----|----|----|----|---|--|
|              |  | <b>A</b> 1 | 0  |    |    |    |    |   |  |
|              |  | A2         | 0  |    |    |    |    |   |  |
|              |  | А3         | 0  |    |    |    |    |   |  |

|   |    | Busy | ор  | vj      | Vk    | Qj | Qk | Α |  |
|---|----|------|-----|---------|-------|----|----|---|--|
| 0 | M1 | 1    | MUL | MEM[80] | R[F2] |    |    |   |  |
| 1 | M2 | 1    | MUL | MEM[72] | R[F2] |    | 19 |   |  |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8 BNEZ R1, LOOP

#### In this cycle:

- the first MUL.D writes result
- Second MUL.D finishes execution.

| Iteration<br># | Instruction |    | j  | k  | Issue | Execution Complete | Write<br>Result |
|----------------|-------------|----|----|----|-------|--------------------|-----------------|
| 1              | L.D         | F0 | 0  | R1 | 1     | 29                 | 10              |
| 1              | MUL         | F4 | F0 | F2 | 2     | 1114               | 15              |
| 1              | S.D         | F4 | 0  | R1 | 3     |                    |                 |
| 2              | L.D         | F0 | 0  | R1 | 6     | 10                 | 11              |
| 2              | MUL         | F4 | F0 | F2 | 7     | 1215               |                 |
| 2              | S.D         | F4 | 0  | R1 | 8     |                    |                 |
| 3              | L.D         | F0 | 0  | R1 | 11    | 12                 | 13              |
|                |             |    |    |    |       |                    |                 |

| Re  | Reg. file |  |  |  |  |  |  |
|-----|-----------|--|--|--|--|--|--|
|     | Qi        |  |  |  |  |  |  |
| FO  | 0         |  |  |  |  |  |  |
| F2  | 0         |  |  |  |  |  |  |
| F4  | M2        |  |  |  |  |  |  |
| F6  |           |  |  |  |  |  |  |
| F8  |           |  |  |  |  |  |  |
| F10 |           |  |  |  |  |  |  |

Clock

15

R1

value

64

|   | _  |    |   |
|---|----|----|---|
| Т | ir | n  | 6 |
| • | •• | •• | • |

| r | remaining A1 |            | Busy | ор | vj | Vk | Qj | Qk | Α |
|---|--------------|------------|------|----|----|----|----|----|---|
|   |              | <b>A</b> 1 | 0    |    |    |    |    |    |   |
|   |              | A2         | 0    |    |    |    |    |    |   |
|   |              | А3         | 0    |    |    |    |    |    |   |

|   |    | Busy | op  | vj      | Vk    | Qj | Qk | Α |
|---|----|------|-----|---------|-------|----|----|---|
| 0 | M1 | 0    |     |         |       |    |    |   |
| 0 | M2 | 1    | MUL | MEM[72] | R[F2] |    | 20 |   |

#### Load buffers

|    | Busy | Address |   |
|----|------|---------|---|
| L1 | 0    |         | S |
| L2 | 0    |         | S |
| L3 |      |         | 2 |
|    |      |         | S |

| S      | 0    |      |              |    |
|--------|------|------|--------------|----|
| S<br>2 | 1    | 72   |              | M2 |
| S<br>1 | 1    | 80   | MEM[80]*1.33 |    |
|        | Busy | Add. | V            | Q  |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1) SUBI R1, R1, 8 BNEZ R1, LOOP

#### In this cycle:

- multiply of iteration 2 writes result
- multiply of iteration 3 can now be **issued**!
- S1

| Iteration<br># | Instruction |    | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|-------------|----|----|----|-------|-----------------------|-----------------|
| 1              | L.D         | F0 | 0  | R1 | 1     | 29                    | 10              |
| 1              | MUL         | F4 | F0 | F2 | 2     | 1114                  | 15              |
| 1              | S.D         | F4 | 0  | R1 | 3     | 16                    |                 |
| 2              | L.D         | F0 | 0  | R1 | 6     | 10                    | 11              |
| 2              | MUL         | F4 | F0 | F2 | 7     | 1215                  | 16              |
| 2              | S.D         | F4 | 0  | R1 | 8     |                       |                 |
| 3              | L.D         | F0 | 0  | R1 | 11    | 12                    | 13              |
| 3              | MUL         | F4 | F0 | F2 | 16    |                       |                 |

Timo

| Re  | Reg. file |  |  |  |  |  |
|-----|-----------|--|--|--|--|--|
|     | Qi        |  |  |  |  |  |
| FO  | 0         |  |  |  |  |  |
| F2  | 0         |  |  |  |  |  |
| F4  | M1        |  |  |  |  |  |
| F6  |           |  |  |  |  |  |
| F8  |           |  |  |  |  |  |
| F10 |           |  |  |  |  |  |

Clock

16

R1

value

64

| can start executing | r |
|---------------------|---|
|---------------------|---|

#### Load buffers

|    | Busy | Address |  |
|----|------|---------|--|
| L1 | 0    |         |  |
| L2 | 0    |         |  |
| L3 | 0    |         |  |

|          | Busy | Add. | V              | Q |
|----------|------|------|----------------|---|
| S        | 1    | 80   | MEM[80]*1.33   |   |
| 1        |      |      | N45N45703*4 00 |   |
| S  <br>2 | 1    | 72   | MEM[72]*1.33   |   |
| S        | 0    |      |                |   |

| HIIIIE    |  |            |    |    |    |    |    |   |  |
|-----------|--|------------|----|----|----|----|----|---|--|
| remaining |  | Busy       | ор | vj | Vk | Qj | Qk | Α |  |
|           |  | <b>A</b> 1 | 0  |    |    |    |    |   |  |
|           |  | A2         | 0  |    |    |    |    |   |  |
|           |  | А3         | 0  |    |    |    |    |   |  |

|   |    | Busy | ор  | vj      | Vk   | Qj | Qk | Α |
|---|----|------|-----|---------|------|----|----|---|
| 4 | M1 | 1    | MUL | mem[64] | 1.33 |    |    |   |
| 0 | M2 | 0    |     |         |      |    | 21 |   |

L.D F0, 0(R1) LOOP:

> MUL.D F4, F0, F2 S.D F4, 0(R1) SUBI R1, R1, 8

BNEZ R1, LOOP

#### In this cycle:

- Now this S.D of iteration 3 can be issued (it was stuck before because MUL was stuck)
- Store of iter. 1 writes back, store of iter.2 executes

| Iteration<br># | Instruction |    | j  | k  | Issue | Execution<br>Complete | Write<br>Result |
|----------------|-------------|----|----|----|-------|-----------------------|-----------------|
| 3              | S.D         | F4 | 0  | R1 | 17    |                       |                 |
| 1              | MUL         | F4 | F0 | F2 | 2     | 1114                  | 15              |
| 1              | S.D         | F4 | 0  | R1 | 3     | 16                    | 17              |
| 2              | L.D         | F0 | 0  | R1 | 6     | 10                    | 11              |
| 2              | MUL         | F4 | F0 | F2 | 7     | 1215                  | 16              |
| 2              | S.D         | F4 | 0  | R1 | 8     | 17                    |                 |
| 3              | L.D         | F0 | 0  | R1 | 11    | 12                    | 13              |
| 3              | MUL         | F4 | F0 | F2 | 16    | 17                    |                 |

| Re  | eg. file | Clock |
|-----|----------|-------|
|     | Qi       | 17    |
| FO  | 0        | 17    |
| F2  | 0        | R1    |
| F4  | M1       | value |
| F6  |          | 64    |
| F8  |          |       |
| F10 |          |       |

| _ |   |   |   |   |   |
|---|---|---|---|---|---|
| • | ı | r | r | ١ | Д |
|   |   |   |   |   | J |
|   |   |   |   |   |   |

| remaining |  | Busy       | ор | vj | Vk | Qj | Qk | Α |  |
|-----------|--|------------|----|----|----|----|----|---|--|
|           |  | <b>A</b> 1 | 0  |    |    |    |    |   |  |
|           |  | A2         | 0  |    |    |    |    |   |  |
|           |  | А3         | 0  |    |    |    |    |   |  |

|   |    | Busy | ор  | vj      | Vk   | Qj | Qk | Α |
|---|----|------|-----|---------|------|----|----|---|
| 3 | M1 | 1    | MUL | mem[64] | 1.33 |    |    |   |
| 0 | M2 | 0    |     |         |      |    | 22 |   |

#### Load buffers

| Busy | Address |   |
|------|---------|---|
| 0    |         | Ī |
| 0    |         |   |
| 0    |         |   |
|      | 0       | 0 |

|        | Busy | Add. | V            | Q  |
|--------|------|------|--------------|----|
| S<br>1 |      |      |              |    |
| S<br>2 | 1    | 72   | MEM[72]*1.33 |    |
| S      | 1    | 64   |              | M1 |

LOOP: L.D F0, 0(R1)

MUL.D F4, F0, F2

S.D F4, 0(R1)

SUBI R1, R1, 8

BNEZ R1, LOOP

#### In this cycle:

• Store of iteration 2 writes back

| Iteration<br># | Instruction |    | j  | j k |    | Execution Complete | Write<br>Result |  |
|----------------|-------------|----|----|-----|----|--------------------|-----------------|--|
| 3              | S.D F4      |    | 0  | R1  | 17 |                    |                 |  |
| 1              | MUL         | F4 | F0 | F2  | 2  | 1114               | 15              |  |
| 1              | S.D         | F4 | 0  | R1  | 3  | 16                 | 17              |  |
| 2              | L.D         | F0 | 0  | R1  | 6  | 10                 | 11              |  |
| 2              | MUL         | F4 | F0 | F2  | 7  | 1215               | 16              |  |
| 2              | S.D         | F4 | 0  | R1  | 8  | 17                 | 18              |  |
| 3              | L.D         | F0 | 0  | R1  | 11 | 12                 | 13              |  |
| 3              | MUL         | F4 | F0 | F2  | 16 | 17                 |                 |  |

Time

| Reg. file |    |  |  |  |  |  |
|-----------|----|--|--|--|--|--|
|           | Qi |  |  |  |  |  |
| FO        | 0  |  |  |  |  |  |
| F2        | 0  |  |  |  |  |  |
| F4        | M1 |  |  |  |  |  |
| F6        |    |  |  |  |  |  |
| F8        |    |  |  |  |  |  |
| F10       |    |  |  |  |  |  |



#### Load buffers

|    | Busy | Address |        | Busy | Add. | V | Q  |
|----|------|---------|--------|------|------|---|----|
| L1 | 0    |         | S      |      |      |   |    |
| L2 | 0    |         | 1<br>S |      |      |   |    |
| L3 | 0    |         | 2      |      |      | L |    |
|    |      |         | s      | 1    | 64   |   | M1 |

| remaining |  | Busy       | ор | vj | Vk | Qj | Qk | Α |  |
|-----------|--|------------|----|----|----|----|----|---|--|
|           |  | <b>A</b> 1 | 0  |    |    |    |    |   |  |
|           |  | A2         | 0  |    |    |    |    |   |  |
|           |  | <b>A3</b>  | 0  |    |    |    |    |   |  |

|   |    | Busy | ор  | vj      | Vk   | Qj | Qk | Α |  |
|---|----|------|-----|---------|------|----|----|---|--|
| 2 | M1 | 1    | MUL | mem[64] | 1.33 |    |    |   |  |
| 0 | M2 | 0    |     |         |      |    | 23 |   |  |

### And so on..

- As you noticed, issue is always in order, but execution is out of order.
- Instructions are re-ordered dynamically, executed out of order, and all types of hazards are avoided and handled.

# Part 2 Loads and stores

### Loads and stores special cases

 Loads and stores can execute out of order, only if they access different memory addresses.

 What happens when loads and stores access the same memory?

```
LD F0, 10 (R1)
SD F1, 12 (R2)
LD F2, 14 (R3)
SD F3, 16 (R4)
```

- Assume R1=10, R2=8, R3=6, R4=4
- → All of them map to the **same** address.

```
LD F0, 10 (R1)
SD F1, 12 (R2)
LD F2, 14 (R3)
SD F3, 16 (R4)
```

#### Case 1

Load is before the store and we interchange them:

**Result:** F0 would be loaded a new value rather than old → WAR hazard

```
LD F0, 10 (R1)
SD F1, 12 (R2)
LD F2, 14 (R3)
SD F3, 16 (R4)
```

#### Case 2

Store is before the load and we interchange them:

**Result:** F2 would be loaded an old value rather than new → RAW hazard

```
LD F0, 10 (R1)
SD F1, 12 (R2)
LD F2, 14 (R3)
SD F3, 16 (R4)
```

#### Case 3

Store is before the store and we interchange them:

Result: the memory address would get F1 at the end of the program rather than F3 → WAW hazard

#### Solution for the Loads:

Check if there are any uncompleted stores that precedes the load in program order with the same effective address.

#### Solution for the stores:

Wait until uncompleted loads <u>and</u> stores that precede in program order and share the same effective address.



31

### Implementations: load

 Effective address calculations must be done in program order using the address unit.

 When a load finishes the effective address calculation, it's not issued right away but the A field of all the active stores in the store buffer are checked.

 If there's a conflict (same address), the load is not issued to the load buffer.



### Implementations: store

 When a store finishes the effective address calculation, it's not issued right away but the A field of all the active stores and loads in the store and load buffer are checked.

• If there's a conflict (same address), the store is not issued to the store buffer.



### **Example**

 Assume load was issued but not executed yet so it's in the load buffer and it's A field is = 20

 And we want to issue the store.



### Tomasulo in practice

 Tomasulo's scheme was unused for many years after the 360/91, but was widely adopted in multiple-issue processors starting in the 1990s for several reasons:

- 1. Although Tomasulo's algorithm was designed before caches, the presence of caches, with the inherently unpredictable delays, has become one of the major motivations for dynamic scheduling.
  - Out-of-order execution allows the processors to continue executing instructions while awaiting the completion of a cache miss, thus hiding all or part of the cache miss penalty.

### Tomasulo in practice

- 2. As processors became more aggressive in their issue capability and designers are concerned with the performance of difficult-to-schedule code, techniques such as register renaming, dynamic scheduling, and speculation became more important.
- 3. It can achieve high performance without requiring the compiler to target code to a specific pipeline structure, a valuable property in the era of shrink wrapped mass market software.

\*Many modern processors implement dynamic scheduling schemes that are derivative of Tomasulo's original algorithm, including popular Intel x86-64 chips.

### **End of lecture 13**