# Homework3

## 21307099 李英骏

## 2023.11.1

## 1 Problem 1

## 1.1 Correlating Branch Predictor

| Branch PC | Entry | Prediction | Outcome  | Wrong |  |
|-----------|-------|------------|----------|-------|--|
| 2         | 4     | T          | Т        | NO    |  |
| 3         | 6     | NT         | NT       | NO    |  |
| 1         | 2     | NT         | NT       | NO    |  |
| 3         | 7     | NT         | NT       | MO    |  |
| 1         | 3     | Т          | NT       | YES   |  |
| 2         | 4     | Т          | ${ m T}$ | NO    |  |
| 1         | 3     | Т          | NT       | YES   |  |
| 2         | 4     | Т          | ${ m T}$ | NO    |  |
| 3         | 7     | NT         | ${ m T}$ | YES   |  |

表 1: P1-1

error rate =  $\frac{3}{9} \times 100\% \approx 33.3\%$ 

## 1.2 Local Predictor

| Branch PC | Entry | Prediction | Outcome | Wrong |  |
|-----------|-------|------------|---------|-------|--|
| 0         | 0     | Т          | Т       | NO    |  |
| 1         | 4     | Т          | NT      | YES   |  |
| 1         | 1     | NT         | NT      | MO    |  |
| 1         | 3     | T          | NT      | YES   |  |
| 1         | 3     | T          | NT      | YES   |  |
| 0         | 0     | Т          | Т       | NO    |  |
| 1         | 3     | NT         | NT      | NO    |  |
| 0         | 0     | Т          | Т       | NO    |  |
| 1         | 5     | Т          | Т       | NO    |  |

表 2: P1-2

error rate =  $\frac{3}{9} \times 100\% \approx 33.3\%$ 

### 2 Problem 2

| Op   | dest | j  | k  | Issue | Read Oper | Exec Comp | Write Result |
|------|------|----|----|-------|-----------|-----------|--------------|
| LD   | F6   | 34 | R2 | 1     | 2         | 3         | 4            |
| LD   | F2   | 45 | R3 | 5     | 6         | 7         | 8            |
| MULD | F0   | F5 | F2 | 6     | 9         | 19        | 20           |
| MULD | F7   | F2 | F6 | 6     | 9         | 19        | 20           |
| ADDD | F6   | F8 | F7 | 6     | 21        | 23        | 24           |

表 3: P2(a)

| Time | Name    | Busy | Op   | $F_i$ | $F_{j}$ | $F_k$ | $Q_{j}$ | $Q_k$   | $R_{j}$ | $R_k$ |
|------|---------|------|------|-------|---------|-------|---------|---------|---------|-------|
| 9    | Integer | No   |      |       |         |       |         |         |         |       |
| 9    | Mult1   | Yes  | MULD | F0    | F5      | F2    |         | Integer | Yes     | Yes   |
| 9    | Mult2   | Yes  | MULD | F7    | F2      | F6    | Integer | Integer | Yes     | Yes   |
| 9    | Add     | Yes  | ADDD | F6    | F8      | F7    |         | Mult2   |         | No    |

表 4: P2(b)

#### 3 Problem 3

#### 1. Cost of BTB miss:

The cost for BTB miss events, which occur for the 15% of branch instructions not hit in the BTB:

BTB miss cost = Branch frequency × Miss rate × Miss penalty

 $=20\% \times 15\% \times 3$  cycles =0.09 cycles/instruction

#### 2. Cost for hits with wrong prediction:

The cost for cases where BTB hits but the prediction is wrong, which happens for 10% of the 85% of branch instructions where the BTB hits:

 $Hit \ but \ wrong \ prediction \ cost = Branch \ frequency \times Hit \ rate \times Wrong \ prediction \ rate \times Misprediction \ penalty$ 

$$=20\%\times85\%\times10\%\times4$$
 cycles = 0.068 cycles/instruction

#### 3. Total CPI including branch prediction:

The total CPI including branch prediction consists of the base CPI plus the additional CPI due to branch prediction overhead:

Total CPI = Base CPI + BTB miss cost + Hit but wrong prediction cost

$$= 1 + 0.09 + 0.068 = 1.158$$
 cycles/instruction

#### 4. CPI for processor without BTB:

For a processor without a BTB, there is a fixed two-cycle penalty for each branch instruction:

CPI without BTB = Base CPI + Branch frequency 
$$\times$$
 Fixed branch penalty 
$$= 1 + 20\% \times 2 = 1.4 \text{ cycles/instruction}$$

#### 5. Speedup:

The speedup is the ratio of the CPI without BTB to the CPI with BTB:

Speedup = 
$$\frac{\text{CPI without BTB}}{\text{CPI with BTB}}$$
  
  $\approx \frac{1.4}{1.158} \approx 1.209$ 

### 4 Problem 4

(a)

When a hit occurs, it directly leads to executing the next instruction without any delay, effectively decreasing the cycle count by one per branch instruction. Therefore, the penalty adjustment for a buffer hit with an unconditional branch is:

Penalty adjustment for buffer hit = -1 clock cycle per unconditional branch instruction

(b)

Assuming an 80% hit rate for unconditional branches and a frequency of 10% occurrence in instruction set, the calculation for performance improvement from branch folding is as follows:

- For buffer hits (80% of the time), the enhancement saves one clock cycle per hit.
- For buffer misses (20% of the time), the standard penalty is two clock cycles per miss.

The expected improvement from branch folding can therefore be calculated by:

Expected improvement = Frequency of unconditional branches $\times$ 

(Hit rate  $\times$  Improvement per hit + Miss rate  $\times$  Penalty per miss)

$$= 10\% \times (80\% \times (-1 \text{ cycle}) + 20\% \times 2 \text{ cycles})$$

 $= 10\% \times (-0.8 \text{ cycles} + 0.4 \text{ cycles})$ 

= -0.04 cycles per instruction

The negative sign indicates a reduction in the total cycle count per instruction, which is a performance gain. For conditional branches with an assumed two-cycle buffer miss penalty and the same calculation method, the cost would be:

Penalty for conditional branches storing target address =  $10\% \times (80\% \times 0 + 20\% \times 2 \text{ cycles})$ = 0.04 cycles per instruction

The overall performance gain due to the BTB enhancement is:

Overall performance gain = (0.04-(-0.04)) cycles per instruction = 0.08 cycles per instruction improvement

# 5 Problem 5

| Iteration | Instructions    | Issue | Executes | Memory access | Write CDB | Comment                      |
|-----------|-----------------|-------|----------|---------------|-----------|------------------------------|
| 1         | LD F2,0(R1)     | 1     |          | 2             | 3         | 1st instruction              |
| 1         |                 |       |          |               |           | Awaiting F2; 3-              |
|           | MUL.D F4,F2,F0  | 2     | 4        |               | 19        | 4: Reservation Station; 5-   |
|           |                 |       |          |               |           | 18: Multiply execution       |
|           | L.D F6,0(R2)    | 3     |          | 4             | 5         | 4: Load buffer               |
|           |                 |       |          |               |           | Awaiting F4; 5-              |
|           | ADD.D F6,F4,F6  | 4     | 20       |               | 30        | 20: Reservation Station; 21- |
|           |                 |       |          |               |           | 29: Add execution            |
|           | S D E6 0/D2)    | 5     |          | 31            |           | Awaiting F6; 6-              |
|           | S.D F6,0(R2)    | 9     |          | 91            |           | 31: Store buffer             |
|           | DADDIU R1,R1,#8 | 6     | 7        |               | 8         |                              |
|           | DADDIU R2,R2,#8 | 7     | 8        |               | 9         |                              |
|           | DSLTU R3,R1,R4  | 8     | 9        |               | 10        | Awaiting R3                  |
|           | BNEZ R3,foo     | 9     | 11       |               |           | Awaiting BNEZ;               |
|           | L.D F2,0(R1)    | 10    |          | 12            | 13        | Awaiting BNEZ;               |
| 2         | L.D 1 2,0(101)  |       |          |               |           | 11-12: Load buffer           |
|           | MUL.D F4,F2,F0  |       | 19       |               |           | Awaiting F2; 13-             |
|           |                 | 11    |          |               | 34        | 19: Reservation Station; 20- |
|           |                 |       |          |               |           | 33: Multiply execution       |
|           | L.D F6,0(R2)    | 12    |          | 13            | 14        | 13: Load buffer              |
|           |                 |       |          |               |           | Awaiting F4; 14-             |
|           | ADD.D F6,F4,F6  | 13    | 35       |               | 45        | 35: Reservation Station; 36- |
|           |                 |       |          |               |           | 44: Add execution            |
|           | S.D F6,0(R2)    | 14    |          | 46            |           | Awaiting F6; 15-             |
|           | 5.D T 0,0(102)  |       |          | 40            |           | 46: Store buffer             |
|           | DADDIU R1,R1,#8 | 15    | 16       |               | 17        |                              |
|           | DADDIU R2,R2,#8 | 16    | 17       |               | 18        |                              |
|           | DSLTU R3,R1,R4  | 17    | 18       |               | 20        | Awaiting R3                  |
|           | BNEZ R3, foo    | 18    | 20       |               |           | AWaiting R3                  |

表 5: P5