



# **CV32A6 FPGA Optimization**

Sébastien Jacq – 17/11/2022





#### CV32A6 PPA on XC7K325T-2FFG900C (Genesys2)

- Max frequency = 99 MHz
- Configuration
  - ➤ RV32IMAC 6 stages
  - ➤ WT data cache 32KB 8 ways of 4KB
  - ➤ Instruction cache 16KB 4 ways of 4KB
  - > MMU SV32
  - > PMP
  - Dynamic branch prediction

|                   | LUT   | FF    | DSP | BRAM |
|-------------------|-------|-------|-----|------|
| csr_regfile_i     | 1422  | 958   | 0   | 0    |
| ex_stage_i        | 4524  | 2966  | 4   | 0    |
| i_cache_subsystem | 5523  | 2084  | 0   | 36   |
| i_frontend        | 1811  | 2434  | 0   | 0    |
| i_perf_counters   | 129   | 448   | 0   | 0    |
| id_stage_i        | 112   | 168   | 0   | 0    |
| issue_stage_i     | 4954  | 2424  | 0   | 0    |
| CV32A6            | 18103 | 11484 | 4   | 36   |

**OPEN** 

Before optimization



## CV32A6 PPA on XC7K325T-2FFG900C (Genesys2) - LUT



Before optimization

THALES Building a future we can all trust

REF xxxxxxxxxxx rev xxx - date Thales Research & Technology France

### CV32A6 PPA on XC7K325T-2FFG900C (Genesys2) - FF



Before optimization

THALES Building a future we can all trust

REF xxxxxxxxxxx rev xxx - date Thales Research & Technology France



#### CV32A6 PPA on XC7K325T-2FFG900C (Genesys2)

- Max freq= 127 MHz ✓ 27%
- Configuration
  - ➤ RV32IMA 6 stages
  - ➤ WT data cache 8KB 2 ways of 4KB
  - ➤ Instruction cache 8KB 2 ways of 4KB
  - > MMU SV32 (optional)
  - Dynamic branch prediction
  - ➤ 2-level TLB

|                   | LUT  | FF   | DSP | BRAM36 | BRAM18 |
|-------------------|------|------|-----|--------|--------|
| csr_regfile_i     | 218  | 646  | 0   | 0      | 0      |
| ex_stage_i        | 3083 | 1732 | 4   | 0      | 5      |
| i_cache_subsystem | 2038 | 848  | 0   | 12     | 0      |
| i_frontend        | 477  | 213  | 0   | 0      | 1      |
| i_perf_counters   | 0    | 0    | 0   | 0      | 0      |
| id_stage_i        | 245  | 167  | 0   | 0      | 0      |
| issue_stage_i     | 1963 | 797  | 0   | 0      | 0      |
| CV32A6            | 8077 | 4403 | 4   | 12     | 5      |

Current state, pull requests ongoing



#### **CV32A6 LUT optimization**





#### **Further optimizations**

After resource optimization, focus on frequency

#### Ongoing

- Submitting pull requests for resource optimizations
- Breaking critical paths
  - The current interconnect seems to be the limit; 140 MHz looks achievable by the core

**OPEN** 

#### Next steps

- Optionally replace caches by TCM for applications with small memory footprint
- Scoreboard microarchitecture leaning
- ➤ Considered: use LUTRAM for CSR registers



# Thank you

OPEN

