

# Digital System Design

# Final Project Hardware Implementation of Pipelined RISC-V

Speaker: Jackson

Instructor: 吳安宇教授

Date: 2025/05/15



#### **RISC-V Processors**

- Single-Cycle
  - Simple design with low throughput



- Pipelined
  - Higher throughput
  - Complex design
    - For handle hazard







#### Final Project





- + HW2: Single Cycle RISC-V
- Final Project: Pipelined RISC-V Processor
  - With Instruction cache and data cache





#### <u>Pipeline</u>



- Splits into several functional unit and perform instruction in parallel way
- Your design should be based on this 5-stage pipelined structure
  - ❖ Pipeline stage >= 5



#### Required Instruction Set

You need to modify several parts to fit our specifications.

For example, you need to add the path for J-type instructions

Table 1. Required Instruction Set

| Name | Description                                                         |  |  |  |
|------|---------------------------------------------------------------------|--|--|--|
| ADD  | Addition, overflow detection for signed operand is not required*    |  |  |  |
| ADDI | Addition immediate with sign-extension, without overflow detection* |  |  |  |
| SUB  | Subtract, overflow detection for signed operand is not required*    |  |  |  |
| AND  | Boolean logic operation                                             |  |  |  |
| ANDI | Boolean logic operation with 12bit of immediate                     |  |  |  |
| OR   | Boolean logic operation                                             |  |  |  |
| ORI  | Boolean logic operation with 12bit of immediate                     |  |  |  |
| XOR  | Boolean logic operation                                             |  |  |  |
| XORI | Boolean logic operation with 12bit of immediate                     |  |  |  |
| SLLI | Shift left logical (zero padding)                                   |  |  |  |
| SRAI | Shift right arithmetic (sign-digit padding)                         |  |  |  |
| SRLI | Shift right logical (zero padding)                                  |  |  |  |
| SLT  | Set less than, comparison instruction                               |  |  |  |
| SLTI | Set less than variable, comparison instruction                      |  |  |  |
| BEQ  | Branch on equal, conditional branch instruction                     |  |  |  |
| BNE  | Branch on not equal, conditional branch instruction                 |  |  |  |
| JAL  | Unconditionally jump and link (Save next PC in \$rd)                |  |  |  |
| JALR | Jump and link register(Save next PC in \$rd)                        |  |  |  |
| LW   | Load word from data memory (assign word-aligned)                    |  |  |  |
| SW   | Store word to data memory (assign word-aligned)                     |  |  |  |
| NOP  | No operation(addi \$r0 \$r0 0)                                      |  |  |  |

Different from definition in [1], the exception handler for arithmetic overflow is not required.



#### **Hazards**

- In order to implement pipelined-RISC-V, you need to resolving hazard in hardware
  - Structural Hazard



- the hardware cannot support the combination of instructions
- Data Hazard
  - ➤ data that is needed to execute the instruction is not yet available (dynamic).
- Load-use Data Hazard
  - ➤ load instruction (lw) (read data from main memory) has not yet become available when it is requested
- Control hazard (Branch hazard)
  - instruction that was fetched is NOT the one that is needed





#### **How's the test works?**



- I\_mem: Instruction Memory
- D\_mem: Data memory



# **EXTENSIONS**



#### **Extensions**

- Branch Prediction(+define+BrPred)
  - Always not branch
  - Branch / Not branch / Branch
  - Always branch
- Supporting compressed instructions (+define+compression(\_uncompressed))
  - Extract C-Instructions
- Supporting multiplication instructions (+define+Mul)
  - Implement multiplication unit



#### **Branch Prediction**

- For loop is commonly see in programming
  - ❖for(i=0;i<k;i++) → they did BEQ in every loop</p>
- To increase overall efficiency of system, branch prediction is always used

```
ADD $t0, $0, $0
                    #set $t0 to 0
                                        for(int i=0; i != 100; i++) {...}
ADDI$t1, $0, 100
                    #set $t1 to 100
                                        for(int i=0; i!= 100; i++) {...}
SLL...
                                        for(int i=0; i!= 100; i++) \{...\}
                    #loop body
ADD...
                    #assume this loop contains 3 instructions
SUB...
ADDI$t0, $t0, 1
                    #add 1 to $t0
                                             for(int i=0; i!= 100; i++) \{...\}
                    #branch if $t0 != $t1
BNE $t0, $t1, -5
                                             for(int i=0; i != 100; i++) {...}
```

With branch prediction → waste 3 cycles

Waste 99 cycles





#### **Compressed Instruction**

Implement the following 16 C-instructions as the extension to your base RISC-V core

| C.ADD  | C.ANDI | C.LW   | C.J    |
|--------|--------|--------|--------|
| C.MV   | C.SLLI | C.SW   | C.JAL  |
| C.ADDI | C.SRLI | C.BEQZ | C.JR   |
| C.NOP  | C.SRAI | C.BNEZ | C.JALR |

- Extract information encoded in C-instructions
- PC increment
- Address alignment issued





#### <u>Multiplication Instruction</u>

Implement the following Multiplication Instruction to your base RISC-V core

| 1 | R٦ | I3 | 2 | M | S | tand | lard | E: | vte     | nsion  |  |
|---|----|----|---|---|---|------|------|----|---------|--------|--|
|   |    |    |   |   |   |      |      |    | A 10 To | 110101 |  |

|         |     |     |     |                  |         | 4   |
|---------|-----|-----|-----|------------------|---------|-----|
| 0000001 | rs2 | rs1 | 000 | $^{\mathrm{rd}}$ | 0110011 | MUL |

- Design of multiplication unit
- Timing optimization
  - Pipeline/Parallelization
- Hazard issues (if any)





#### Simulation with Final\_tb

```
source /usr/cad/synopsys/CIC/vcs.cshrc
source /usr/spring_soft/CIC/verdi.cshrc

// RTL
vcs Final_tb.v rtl_files.v slow_memory.v -full64 -R -debug_access+all +v2k
+define+noHazard

// Gate level
vcs Final_tb.v CHIP_syn.v slow_memory.v -v tsmc13.v -full64 -R
-debug_access+all +v2k +define+noHazard +define+SDF
```

```
`ifdef noHazard

`define IMEM_INIT "../Baseli

`include "../Baseline/TestBe
 `define DMEM_INIT "../Baseli

`endif

`ifdef hasHazard

`define IMEM_INIT "../Baseli

`include "../Baseline/TestBe
 `define DMEM_INIT "../Baseli

`endif

`ifdef BrPred
 `define IMEM_INIT "../Extens
 `include "../Extension/BrPre
 `define DMEM_INIT "../Baseli

`endif
```

- Final\_tb.v in src/
- Change +noHazard for testing different cases
  - +hasHazard
  - +BrPred/+compression(\_uncompressed)/+Mul
  - +QSort(\_uncompressed)/+Conv(\_uncompressed)



# **Grading Policy**

- The score of this project consists of two aspects:
  - 1) Baseline checkpoint (40%)
    - Checkpoint Presentation (5%)
    - noHazard Gate Level (15%)
    - hasHazard Gate Level (20%)
  - 2) Final presentation and submission (60%)
    - > Final Presentation (8%)
    - Branch Prediction Gate Level (9%)
    - Compressed Instr. Gate Level (9%)
    - Multiplication Instr. Gate Level (9%)
    - Q\_sort & Conv AT Ranking (15%)
    - > Report (10%)



### **Baseline Checkpoint (40%)**

- Checkpoint
  - 4-6 pages slides (about 5 minutes)
  - Confirm your current results and future plan
- Pass test programs (noHazard, hasHazard)
  - Supporting all instructions above
  - With caches
  - Complete the circuit synthesis. Note that the slack cannot be negative.
  - ❖ Your cycle time must below 3.5ns



# Final Presentation and Submission (60%)

- Each team should prepare a full talk (about 10-15 slides within 10 minutes) for your fantastic work on extension!
- Implement as much and deep as you can of the topics of extension
  - ❖ Deeper exploration → higher score
  - The content also affect quality of presentation and report
- The AT performance is evaluated by Q\_sort and Conv
  - Area (um2)
    - × Simulation time of Q\_sort (ns)
    - × Simulation time of Conv (ns)



# **XNotice**

- 1. Latches are not allowed in gate level code after synthesis, use Flip-flop instead.
- 2. Negative Slack and Timing Violations are not allowed after synthesis.
- 3. There will be hidden cases for all test patterns.
- 4. Cycle time in Gate-level simulation cannot be lower than the cycle time in SDC file.
- 5. You should not assume memory size is 1024 x 128 bit.



# **XNotice**

- 6. PC address should be reset to 0.
- 7. You should identify suitable input / output delay based on the testbench.
- 8. The behavior of slow memory is different from the memory in HW3.





# **Schedule**

| Date | Submission/Event                                                      |
|------|-----------------------------------------------------------------------|
| 5/15 | Final project announcement                                            |
| 5/29 | A. Baseline checkpoint. Each team should prepare a presentation (4-   |
|      | 6 pages, about 5 minutes) to confirm your current results and         |
|      | future plan.                                                          |
|      | B. You should upload the source codes and slide to COOL before        |
|      | 5/28 23:59.                                                           |
|      | C. Briefly show your design for passing Baseline testbenches          |
|      | (noHazard, hasHazard).                                                |
|      | D. Extension topics plan should be included in the presentation       |
|      | E. Please attach the work assignment chart on last page               |
| 6/12 | A. Final presentation. Each team should prepare a full talk (about    |
|      | 10-15 slides within 10 minutes) to demonstrate your fantastic         |
|      | work! Detail presentation plan will be announced on COOL.             |
|      | B. You should upload the source codes (including all the RTL code     |
|      | and synthesis related files: *.v, *.sdf, *.ddc and a Readme.txt)      |
|      | to COOL before 6/11 23:59.                                            |
| 6/16 | Report submission. You should upload the report (including a          |
|      | detailed report (8-16 pages) and the presentation slides). You should |
|      | upload them to COOL before 6/16 23:59.                                |