# Modern Processor Design

Hung-Wei Tseng

## Recap: Why adding a sort makes it faster

Why the sorting the array speed up the code despite the increased instruction count?

#### **Outline**

- Pipelined Processor
- Pipeline Hazards
  - Structural Hazards
  - Control Hazards
  - Data Hazards

# Basic Processor Design

#### von Neuman Architecture







#### **Program**

00c2e800 00000008 00c2f000 00000008 00c2f800 00000008 00c30000 00000008

Storage

#### Recap: Microprocessor — a collection of functional units



#### The "life" of an instruction

- Instruction Fetch (IF) fetch the instruction from memory
- Instruction Decode (ID)
  - Decode the instruction for the desired operation and operands
  - Reading source register values
- Execution (EX)
  - ALU instructions: Perform ALU operations
  - Conditional Branch: Determine the branch outcome (taken/not taken)
  - Memory instructions: Determine the effective address for data memory access
- Data Memory Access (MEM) Read/write memory
- Write Back (WB) Present ALU result/read value in the target register
- Update PC
  - If the branch is taken set to the branch target address
  - Otherwise advance to the next instruction current PC + 4

## Functional Units of a Microprocessor



## Simple implementation w/o branch



- Different parts of the processor works on different instructions simultaneously
- A processor is now working on multiple instructions from the same program (though on different stages) simultaneously.
  - ILP: Instruction-level parallelism
- A clock signal controls and synchronize the beginning and the end of each part of the work
- A pipeline register between different parts of the processor to keep intermediate results necessary for the upcoming work





# "Pipeline" the processor!





## How well can we pipeline?

• With a pipelined design, the processor is supposed to deliver the outcome of an instruction each cycle. For the following code snippet, how many pairs of instructions are preventing the pipeline from generating results in back-to-back cycles?

```
xorl
               %eax, %eax
  L3: movl (%rdi), %ecx
               %ecx, %eax
       addl
3
       addq
              $4, %rdi
4
               %rdx, %rdi
(5)
       cmpq
6
       jne
               .L3
7
       ret
A. 1
B. 2
C. 3
D. 4
E. 5+
```

```
for(i = 0; i < count; i++) {
    s += a[i];
}
return s;</pre>
```



# Pipeline hazards

## Three types of pipeline hazards

- Structural hazards resource conflicts cannot support simultaneous execution of instructions in the pipeline
- Control hazards the PC can be changed by an instruction in the pipeline
- Data hazards an instruction depending on a the result that's not yet generated or propagated when the instruction needs that

# Stall — the universal solution to pipeline hazards

#### Stall whenever we have a hazard

 Stall: the hardware allows the earlier instruction to proceed, all later instructions stay at the same stage

```
WB
                                EX
① xorl %eax, %eax
                                   MEM
                                       WB
@ movl (%rdi), %ecx
                                IF
                                    ID
                                        ID
                                               EX
                                            ID
                                                   WB
③ addl %ecx, %eax
                                               ID
                                                   EX
@ addq $4, %rdi
                                                       WB
© cmpq %rdx, %rdi
                                                   ID
                                                           ID
                                                                  WB
                                                       ID
                                                                  ID
                                                   IF
                                                           IF
                                                               ID
© jne .L3
```

# Slow! — 5 additional cycles

② ret

# Structural Hazards

## Dealing with the conflicts between ID/WB

- The same register cannot be read/written at the same cycle
- Better solution: write early, read late
  - Writes occur at the clock edge and complete long enough before the end of the clock cycle.
  - This leaves enough time for outputs to settle for reads
  - The revised register file is the default one from now!

#### How to with the conflicts between MEM and IF?

The memory unit can only accept/perform one request each

cycle

```
1 xorl %eax, %eax | F | D | EX | WB |
2 movl (%rdi), %ecx | F | D | MEM |
3 addl %ecx, %eax | F | D | IF | D | Instruction fetch |
4 addq $4, %rdi | F | D | IF | D | Instruction fetch |
5 addq $4, %rdi | F | D | IF | D | Instruction fetch |
6 addq $4, %rdi | F | D | IF | D | Instruction fetch |
7 addq $4, %rdi | F | D | IF | D | Instruction fetch |
8 addq $4, %rdi | F | D | MEM |
9 addq $4, %rdi | F | D | IF | D | Instruction fetch |
9 addq $4, %rdi | F | D | IF | D | Instruction fetch |
9 addq $4, %rdi | F | D | IF | D | Instruction fetch |
9 addq $4, %rdi | F | D | IF | D | Instruction fetch |
9 addq $4, %rdi | F | D | IF | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $4, %rdi | F | D | Instruction fetch |
9 addq $
```

"Split L1" cache!



**Processor** 

Core

Registers

data access

**D-L1\$** 

# Split L1-\$



#### What if the memory instruction needs more time?



#### What if the memory instruction needs more time?

 Every instruction needs to go through exactly the same number of stages

# **Blocking cache**



# Multibanks & non-blocking caches



# Pipelined access and multi-banked caches





#### Structural Hazards

- Stall can address the issue but slow
- Improve the pipeline unit design to allow parallel execution
  - Write-first, read later register files
  - Split L1-Cache
  - Force all instructions go through exactly the same number of stages

# **Control Hazards**

#### **Control Hazard**

- ① cmpq %rdx, %rdi
- ② jne .L3
- 3 ret



We cannot know if we should fetch (7) or (2) before the EX is done

#### How does the code look like?

```
for (j = 0; j < reps; ++j) {
    for (unsigned i = 0; i < size; ++i) {
        if (data[i] >= threshold)
```

data[i] < threshold

```
loop0:
.LFB0:
   .cfi_startproc
   endbr64
   pushq %rbp
   .cfi_def_cfa_offset 16
   .cfi_offset 6, −16
   movq %rsp, %rbp
   .cfi_def_cfa_register 6
   movq %rdi, -24(%rbp)
   movl %esi, -28(%rbp)
   movl \%edx, -32(\%rbp)
   movl %ecx, -36(%rbp)
   mov1 \$0, -8(\%rbp)
   movl \$0, -12(\%rbp)
         .L2
   jmp
```

```
We skip the following code block if We use "backward" branches (taking if
                                      going back) to implement loops
```

```
.L6:
   movl $0, -4(%rbp)
         .L3
   jmp
.L5:
   movl -4(%rbp), %eax
   leaq 0(,%rax,4), %rdx
  movq -24(\%rbp), \%rax
   addq %rdx, %rax
  movl (%rax), %eax
   cmpl %eax, -32(%rbp)
   jg .L4
   addl $1, -8(\%rbp)
.L4:
   addl $1, -4(%rbp)
.L3:
  movl = -28(\%rbp), %eax
```

```
cmpl %eax, -4(%rbp)
  jb .L5
  addl $1, -12(%rbp)
.L2:
  movl -12(%rbp), %eax
  cmpl -36(%rbp), %eax
  jl .L6
  movl = -8(\%rbp), \%eax
  popq %rbp
  .cfi_def_cfa 7, 8
  ret
```

# **Dynamic Branch Prediction**

# Microprocessor with a "branch predictor"



## Detail of a basic dynamic branch predictor



### 2-bit/Bimodal local predictor

- Local predictor every branch instruction has its own state
- 2-bit each state is described using 2 bits
- Change the state based on actual outcome
- If we guess right no penalty

 If we guess wrong — flush (clear pipeline registers) for mis-predicted instructions that are currently in IF and ID stages and reset the PC

branch PC target PC 5

0x400048 0x400032 10

Predict Taken 0x400080 0x400068 11

0x401080 0x401100 00

0x4000F8 0x400100 01



Not taken

Weak

Strong

### 2-bit local predictor



# Two-level global predictor

Marius Evers, Sanjay J. Patel, Robert S. Chappell, and Yale N. Patt. 1998. An analysis of correlation and predictability: what makes two-level branch predictors work. In Proceedings of the 25th annual international symposium on Computer architecture (ISCA '98).

### 2-bit local predictor

 What's the overall branch prediction (include both branches) accuracy for this nested for loop?

(assume all states sta**repeats** all the time of NT NT NT

| Λ  | ~25%  |
|----|-------|
| А. | ~25/0 |

B. ~33%

C. ~50%

D. ~67%

E. ~75%

For branch Y, almost 100%, For branch X, only 50%

| 3           | X  | _ 00 | NT | Т  |
|-------------|----|------|----|----|
|             | me | 10   | Т  | Т  |
| 3           | X  | 01   | NT | NT |
| 3           | Υ  | 11   | Т  | Т  |
| 4           | Χ  | 00   | NT | Т  |
| 4           | Υ  | 11   | Т  | Т  |
| 5           | Χ  | 01   | NT | NT |
| 5<br>5<br>6 | Υ  | 11   | Т  | Т  |
| 6           | X  | 00   | NT | Т  |
| 6           | Υ  | 11   | Т  | Т  |
|             |    |      |    |    |

branch? state prediction actual

NT

NT

NT

NT

NT

00

00

01

X

## Detail of a basic dynamic branch predictor



#### Performance of GH predictor

```
i = 0;
do {
    if( i % 2 != 0) // Branch X, taken if i % 2 == 0
        a[i] *= 2;
    a[i] += i;
} while ( ++i < 100)// Branch Y</pre>
```

Near perfect after this

| i  | branch? | GHR | state | prediction | actual |
|----|---------|-----|-------|------------|--------|
| 0  | X       | 000 | 90    | NT         | T      |
| 0  | Y       | 001 | 00    | NT         | T      |
| 1  | X       | 011 | 00    | NT         | NT     |
| 1  | Y       | 110 | 00    | NT         | T      |
| 2  | X       | 101 | 00    | NT         | T      |
| 2  | Y       | 011 | 00    | NT         | Т      |
| 3  | X       | 111 | 00    | NT         | NT     |
| 3  | Y       | 110 | 01    | NT         | Т      |
| 4  | X       | 101 | 01    | NT         | Т      |
| 4  | Y       | 011 | 01    | NT         | Т      |
| 5  | X       | 111 | 00    | NT         | NT     |
| 5  | Y       | 110 | 10    | Т          | T      |
| 6  | X       | 101 | 10    | Т          | Т      |
| 6  | Y       | 011 | 10    | Т          | Т      |
| 7  | X       | 111 | 00    | NT         | NT     |
| 7  | Y       | 110 | 11    | T          | Т      |
| 8  | X       | 101 | 11    | Т          | Т      |
| 8  | Y       | 011 | 11    | Т          | Т      |
| 9  | X       | 111 | 00    | NT         | NT     |
| 9  | Y       | 110 | 11    | T          | T      |
| 10 | X       | 101 | 11    | Т          | Т      |
| 10 | Y       | 011 | 11    | Т          | Т      |

# Hybrid predictors

#### **Tournament Predictor**



Local History Predictor

branch PC local history

| 0x400048 | 1000 |
|----------|------|
| 0x400080 | 0110 |
| 0x401080 | 1010 |
| 0x4000F8 | 0110 |

**Predict Taken** 

#### **Tournament Predictor**

- The state predicts "which predictor is better"
  - Local history
  - Global history
- The predicted predictor makes the prediction

# Perceptron

Jiménez, Daniel, and Calvin Lin. "Dynamic branch prediction with perceptrons." Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture. IEEE, 2001.

The following slides are excerpted from <a href="https://www.jilp.org/cbp/Daniel-slides.PDF">https://www.jilp.org/cbp/Daniel-slides.PDF</a> by Daniel Jiménez

#### Branch Prediction is Essentially an ML Problem

- The machine learns to predict conditional branches
- Artificial neural networks
  - Simple model of neural networks in brain cells
  - Learn to recognize and classify patterns

### **Mapping Branch Prediction to NN**

- The inputs to the perceptron are branch outcome histories
  - Just like in 2-level adaptive branch prediction
  - Can be global or local (per-branch) or both (alloyed)
  - Conceptually, branch outcomes are represented as
    - +1, for taken
    - -1, for not taken
- The output of the perceptron is
  - Non-negative, if the branch is predicted taken
  - Negative, if the branch is predicted not taken
- Ideally, each static branch is allocated its own perceptron

### Mapping Branch Prediction to NN (cont.)

- Inputs (x's) are from branch history and are -1 or +1
- n + 1 small integer weights (w's) learned by on-line training
- Output (y) is dot product of x's and w's; predict taken if y = 0
- Training finds correlations between history and outcome



### **Training Algorithm**

```
x_{1..n} is the n-bit history register, x_0 is 1.
w_{0..n} is the weights vector.
t is the Boolean branch outcome.
\theta is the training threshold.
if |y| \le \theta or ((y \ge 0) \ne t) then
     for each 0 \le i \le n in parallel
         if t = x_i then
              w_i := w_i + 1
         else
              w_i := w_i - 1
         end if
     end for
end if
```

## **Predictor Organization**



### Branch predictors in processors

- The Intel Pentium MMX, Pentium II, and Pentium III have local branch predictors with a local 4-bit history and a local pattern history table with 16 entries for each conditional jump.
- Global branch prediction is used in Intel Pentium M, Core, Core 2, and Silvermont-based Atom processors.
- Tournament predictor is used in DEC Alpha, AMD Athlon processors
- The AMD Ryzen multi-core processor's Infinity Fabric and the Samsung Exynos processor include a perceptron based neural branch predictor.

# Branch and programming

#### **Demo revisited**

```
if(option)
    std::sort(data, data + arraySize);

for (unsigned i = 0; i < 100000; ++i) {
    int threshold = std::rand();
    for (unsigned i = 0; i < arraySize; ++i) {
        if (data[i] >= threshold)
            sum ++;
    }
}
```

SELECT count(\*) FROM TABLE WHERE val < A and val >= B;

#### **Demo: Popcount**

- The population count (or popcount) of a specific value is the number of set bits (i.e., bits in 1s) in that value.
- Applications
  - Parity bits in error correction/detection code
  - Cryptography
  - Sparse matrix
  - Molecular Fingerprinting
  - Implementation of some succinct data structures like bit vectors and wavelet trees.

#### **Demo: Popcount**

• Given a 64-bit integer number, find the number of 1s in its binary representation.

• Example 1:

Input: 59487

Output: 9

Explanation: 59487's binary

representation is

Ob10110010100001111

```
int main(int argc, char *argv[]) {
     uint64_t key = 0xdeadbeef;
     int count = 1000000000;
     uint64_t sum = 0;
     for (int i=0; i < count; i++)
         sum += popcount(RandLFSR(key));
     printf("Result: %lu\n", sum);
     return sum;
```

#### Hardware acceleration

- Because popcount is important, both intel and AMD added a POPCNT instruction in their processors with SSE4.2 and SSE4a
- In C/C++, you may use the intrinsic "\_mm\_popcnt\_u64" to get # of "1"s in an unsigned 64-bit number
  - You need to compile the program with -m64 -msse4.2 flags to enable these new features

```
#include <smmintrin.h>
inline int popcount(uint64_t x) {
    int c = _mm_popcnt_u64(x);
    return c;
}
```

## Tricky C/C++ programming questions?

- Give a fastest way to multiply any number by 9 compilers can help
- How to measure the size of any variable without "sizeof" operator?.
- How to measure the size of any variable without using "sizeof" operator?
- · Write code snippets to swap two variables in five different ways other fancy ways are slower
- How to swap between first & 2nd byte of an integer in one line statement? does not help performance
- What is the efficient way to divide a no. by 4? compilers can help
- Suggest an efficient method to count the no. of 1's in a 32 bit no. Remember without using loop & testing each bit.
- Test whether a no. is power of 2 or not. programmers can help
- programmers can help

- How to check endianness of the computer.
- other fancy ways are slower • Write a C-program which does the addition of two integers without using '+' operator.
- · Write a C-program to find the smallest of three integers without using any of the comparision operators.
- Find the maximum & minimum of two numbers in a single line without using any condition & loop.
- What "condition" expression can be used so that the following code snippet will print Hello world.
- How to print number from 1 to 100 without using conditional operators.
- WAP to print 100 times "Hello" without using loop & goto statement.
- Write the equivalent expression for x%8.

https://www.emblogic.com/blog/12/tricky-c-interview-questions/