Skip to content

Akshat1508/SimpleRISC-ALU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SimpleRISC ALU — High-Performance 32-bit ALU Design

A fully synthesizable, high-performance Arithmetic Logic Unit (ALU) designed for the SimpleRISC single-cycle processor core, implemented in Verilog RTL. The design targets a 250 MHz clock frequency and supports all arithmetic, logical, shift, and comparison operations defined by the SimpleRISC ISA.


Table of Contents


Overview

This project replaces the placeholder ALU in the given SimpleRISC RTL core with a performance-optimized implementation. All functional units run in parallel (combinational), with a final MUX selecting the result based on the 4-bit op control signal. This avoids sequential bottlenecks and allows the synthesizer to meet the 250 MHz timing target.

Inputs:

  • a[31:0] — Operand A
  • b[31:0] — Operand B
  • op[3:0] — Operation select

Outputs:

  • y[31:0] — Result
  • zero — Flag: asserted when y == 0

ALU Operation Set

Opcode Mnemonic Description
0000 ADD Signed/unsigned addition
0001 SUB Signed/unsigned subtraction
0010 AND Bitwise AND
0011 OR Bitwise OR
0100 XOR Bitwise XOR
0101 SLT Set to 1 if A < B (signed), else 0
0110 SLL Logical shift left
0111 SRL Logical shift right
1000 SRA Arithmetic shift right
1001 PASS Pass operand B through
1010 NOT Bitwise NOT of operand B
1011 MUL Signed 32×32 multiplication
1100 DIV Signed division (quotient)
1101 MOD Signed modulus (remainder)

Architecture

All functional units are instantiated simultaneously and compute results in parallel. The top-level ALU mux selects among them based on op.

Kogge-Stone Parallel Prefix Adder

File: rtl/fast_adders.v

The add/subtract unit uses a 32-bit Kogge-Stone adder (fast_ksa32). This is a parallel prefix adder that computes all carry signals in O(log₂N) logic stages, yielding minimal critical path delay compared to a ripple-carry or carry-lookahead design.

  • Subtraction is performed via 2's complement: b is inverted and cin=1 is asserted when op == SUB.
  • A 64-bit variant (fast_ksa64) is also included for use inside the multiplier's final accumulation stage.

Radix-4 Booth Multiplier with Dadda Tree

Files: rtl/multiplier.v, rtl/encoder.v, rtl/tree_reducer.v

The multiplier (comb_mult32x32) uses a three-stage pipeline of combinational logic:

  1. Radix-4 Booth Encoding (radix4_booth_encoder): Encodes the 32-bit multiplier into 16 partial products (each 64-bit wide). This halves the number of partial products compared to a naive approach, reducing tree height.

  2. Dadda Tree Reduction (dadda_tree_reducer): Compresses the 16 partial products into two 64-bit operands (sum + carry) using cascaded 3:2 carry-save adders (CSAs). CSAs eliminate carry propagation during compression.

  3. Final Kogge-Stone Addition (kogge_stone_adder_64bit): Adds the two 64-bit outputs from the Dadda tree using the fast KS adder, producing the 64-bit signed product. The lower 32 bits are returned as the ALU result.

Non-Restoring Signed Divider

File: rtl/divider.v

The divider (signed_divider32) uses a non-restoring division algorithm operating on absolute values:

  1. Sign bits are extracted and the result sign is computed (num_sign XOR den_sign).
  2. The non-restoring loop iterates 32 times, conditionally adding or subtracting the divisor based on the sign of the running remainder.
  3. A correction step restores the remainder if it ends negative.
  4. Final sign correction is applied to both quotient and remainder.
  5. Division by zero returns 0 for both outputs.

Note: The iterative loop unrolls fully in combinational synthesis, making this a single-cycle operation at the cost of area.

Barrel Shifter

File: rtl/shifter.v

The barrel shifter (shifter32_opt) supports all three shift modes via a single unified 5-stage mux chain:

  • SLL (Logical Left Shift): Input bits are reversed before the shift stages, then reversed again at output — converting a right-shift network into a left-shift with zero hardware duplication.
  • SRL (Logical Right Shift): Standard right shift, filling with 0.
  • SRA (Arithmetic Right Shift): Right shift filling with the sign bit (input[31]).

Each of the 5 stages conditionally shifts by 1, 2, 4, 8, or 16 positions based on each bit of shift_amt[4:0].

SLT Unit

File: rtl/slt.v

The Set-Less-Than unit (slt32_opt) computes signed comparison by subtracting B from A using a Kogge-Stone adder and examining the result sign bit, with overflow detection to handle mixed-sign edge cases correctly:

overflow = (sign_a != sign_b) AND (sign_diff == sign_b)
result   = sign_diff XOR overflow

Project Structure

SimpleRISC-ALU/
├── rtl/                        # Synthesizable RTL source files
│   ├── alu.v                   # Top-level ALU (integrates all units)
│   ├── fast_adders.v           # 32-bit and 64-bit Kogge-Stone adders + CSA
│   ├── multiplier.v            # 32×32 Booth multiplier top-level
│   ├── encoder.v               # Radix-4 Booth encoder (16 partial products)
│   ├── tree_reducer.v          # Dadda tree partial product reducer
│   ├── divider.v               # 32-bit signed non-restoring divider
│   ├── shifter.v               # 32-bit barrel shifter (SLL/SRL/SRA)
│   ├── slt.v                   # Signed set-less-than comparator
│   ├── simplerisc_top.v        # SimpleRISC processor top-level
│   ├── control_unit.v          # Instruction decoder / control logic
│   ├── regfile.v               # 32×32 register file
│   ├── imem.v                  # Instruction memory
│   ├── immu.v                  # Instruction memory management unit
│   └── decode.vh               # Decode macros / opcode definitions
├── tb/                         # Testbenches
│   ├── tb_alu.v                # Standalone ALU testbench
│   └── tb_simplerisc.v         # Full SimpleRISC core testbench
├── tools/
│   └── asm.py                  # SimpleRISC assembler (Python)
├── docs/
│   └── COA2_Design_Report.docx # Full design report
├── program.asm                 # Assembly test program
├── program.hex                 # Assembled hex for instruction memory
├── Makefile                    # Build and simulation targets
└── .gitignore

Getting Started

Prerequisites

  • Icarus Verilog (iverilog, vvp) for simulation
  • GTKWave for waveform viewing (optional)
  • Python 3.x for the assembler

Install on Ubuntu/Debian:

sudo apt install iverilog gtkwave python3

Running Simulation

# Compile and run full SimpleRISC simulation
make run

# Build only (no run)
make build

# View waveform (requires GTKWave)
make wave

# Clean build artifacts
make clean

To run the standalone ALU testbench:

iverilog -g2012 -o alu_tb.vvp tb/tb_alu.v rtl/alu.v && vvp alu_tb.vvp

Assembling a Program

Use the provided Python assembler to convert .asm.hex:

python3 tools/asm.py program.asm program.hex

The resulting program.hex is loaded into instruction memory by the testbench.


Test Cases

The program.asm file includes the following test operations, covering all major ALU paths:

Instruction Operation Expected Behaviour
div r4, r1, r2 DIV -103 / 10 = -10 (signed quotient)
mod r5, r1, r2 MOD -103 mod 10 = -3 (signed remainder)
mul r6, r1, r2 MUL -103 × 10 = -1030
add r7, r1, r2 ADD -103 + 10 = -93
sub r8, r1, r2 SUB -103 - 10 = -113
asr r9, r3, #4 SRA -1 >> 4 = -1 (arithmetic, sign-extended)
lsr r10, r3, #4 SRL -1 (0xFFFF) >> 4 = 0x0FFFFFFF
and r11, r1, r2 AND Bitwise AND of -103 and 10
or r12, r1, r2 OR Bitwise OR of -103 and 10
not r13, r1 NOT Bitwise NOT of -103

Additional custom test cases to add (minimum 5 required per assignment spec):

  1. SLT with equal operands: slt r0, r5, r5 → expect r0 = 0
  2. SLT with negative < positive: slt r0, r1, r2 (−103 < 10) → expect r0 = 1
  3. XOR self-clear: xor r0, r3, r3 → expect r0 = 0, zero = 1
  4. MUL overflow check: multiply two large values, verify lower 32 bits only
  5. DIV by zero: divisor = 0, expect quotient = 0 (safe default behaviour)

Design Trade-offs

Design Choice Benefit Cost
Kogge-Stone adder (O(log N) depth) Meets 250 MHz; minimal carry chain delay Higher fan-out, more wiring
Radix-4 Booth encoding Halves partial products (32→16) Encoder logic overhead
Dadda tree vs Wallace tree Slightly lower gate count Marginally more complex routing
Fully unrolled combinational divider Single-cycle latency Large area; long critical path for DIV/MOD
Parallel instantiation of all units Clean mux-select structure; synthesizer optimizes unused paths All units always powered

The divider has the longest critical path of any functional unit. If 250 MHz is not met post-synthesis, consider pipelining the divider (2–4 stages) while keeping all other units single-cycle.


Computer Organization and Architecture — Assignment 2

About

High-performance 32-bit ALU for the SimpleRISC single-cycle processor — Kogge-Stone adder, Radix-4 Booth multiplier, Dadda tree reduction, barrel shifter | Verilog RTL | 250 MHz target

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors