# Accelerating the Viterbi Algorithm

## 1. Project Overview

Designed a hardware accelerator for the Viterbi and forward algorithms on a Hidden Markov Model (HMM), integrated via Verilog, tested in Icarus Verilog and Cocotb, and prepared for ASIC synthesis with OpenLane 2.

**Why Viterbi?** Decodes the most probable hidden-state sequence in an HMM (e.g. weather inference, speech recognition). The core "add-compare-select" (ACS) is highly regular and maps naturally to a systolic array



Viterbi Model

### 2. Goals & Success Criteria

- **Benchmark SW implementation** to identify bottlenecks using Python profiling fileciteturn1file4.
- Design and verify a Verilog-based Viterbi PE (viterbi\_pe.sv) and top module (viterbi\_top.sv) with a SystemVerilog testbench.
- **Automate functional verification** via Cocotb tests (test\_viterbi\_cocotb.py) and pytest runner (test\_viterbi\_runner.py).
- **Synthesize to ASIC** with OpenLane 2 using config.json, constraints.sdc, and pin order.cfg.

## 3. Software Profiling & Bottleneck Analysis

- 1. Initial profiling with cProfile on hmm\_demo\_profiling.py revealed:
  - Sequence generation (~10 s)

### 94000050 function calls in 308.069 seconds

Forward algorithm (log-sum-exp) (~7.6 s)

## 156.332 hmm\_demo\_profiling\_progress.py:35(forward\_log)

Viterbi (max-plus DP) (~4.0 s)

## 69.145 hmm\_demo\_profiling\_progress.py:51(viterbi)

**Prompt**: "Profile my Python HMM code and tell me the hotspots."

- 2. Detailed run on hmm demo profiling progress.py (N = 1 000 000) showed:
  - forward\_log: 38 s total (15 s loop + 23 s NumPy reductions)
  - o viterbi: 20 s total (13.8 s loop + 6.2 s overhead) fileciteturn1file4.
- 3. Takeaway: implement a systolic array for Viterbi's Add-Compare-Select (ACS) and a reduction tree for log-sum-exp in hardware.

## 4. Verilog Accelerator Design

- PE design (viterbi\_pe.sv): combinational best-of-I search and register pipeline.
- Top module (viterbi\_top.sv): FSM with INIT, FORWARD (two-stage), BACKTRACK, and DONE states.

## 5. Key Steps & Milestones

- 1. Python Profiling
  - Used cProfile on hmm\_demo\_profiling.py to measure: forward\_log → ~38 s total (15 s Python loop + 23 s NumPy reductions)
  - o viterbi  $\rightarrow$  ~20 s total (13.8 s loop + 6.2 s NumPy overhead.

<u>Prompt</u>: "Profile my forward\_log and viterbi functions, show top-10 hotspots." PE Design (viterbi\_pe.sv)

Combinational search for max over I inputs + add emission logB; registered outputs.
Prompt: "Generate synthesizable Verilog for a Viterbi PE that finds max-plus and emits ψ pointer."

Top-Level FSM (viterbi\_top.sv)

- Two-stage pipeline: INIT0 → FORWARD (compute δ and ψ each cycle) → BACKTRACK → DONE.
- Discovered I/O-timing bug ("stuck at x" path outputs) → added explicit FORWARD\_WAIT stage to register δ updates.

Prompt: "Why do I see xxxx in path[0..N-2]? Check my delta\_prev vs. delta\_next handshake."

## 6. Simulation & Verification

- Icarus + \$display testbench → preliminary "x x x x 0" → fixed sync stages.
- cocotb testbench (test\_viterbi\_cocotb.py) with flattened-signal interface → automated random & edge-case tests.

Prompt: "Write a cocotb test to drive my flattened-array Viterbi top, compare against Python reference."

## 7. Steps to Run

• Icarus Verilog(iverilog) for simulation:

| iverilog -g2012 -o sim_viterbi.vvp viterbi_pe.v viterbi_top.v tb_viterbi.v |
|----------------------------------------------------------------------------|
| vvp sim_viterbi.vvp                                                        |
| gtkwave wave.vcd                                                           |

### Cocotb Simulation:

| python3 -m venv venv                       | # Activate Python environment |  |  |
|--------------------------------------------|-------------------------------|--|--|
| source venv/bin/activate                   |                               |  |  |
| pip install cocotb cocotb-bus numpy pytest |                               |  |  |
| make SIM=icarus                            | # Run cocotb tests            |  |  |

### Openlane flow:

sudo openlane --dockerized ./config.json

### 8. Hardware Flow

• Prepared OpenLane 2 config (config.json, constraints.sdc, pin\_order.cfg)

| Symptom                     | Root Cause                                   | Fix                                                                 |
|-----------------------------|----------------------------------------------|---------------------------------------------------------------------|
| Path outputs show x x x x 0 | delta_prev updated and sampled in same cycle | Introduced a FORWARD_WAIT state to register delta_next → delta_prev |

| cocotb install failing under system Python            | Externally-managed environment (PEP 668) | Moved to a virtualenv; updated Makefile to avoid system pip installs |
|-------------------------------------------------------|------------------------------------------|----------------------------------------------------------------------|
| GitHub push rejected (VCD >100 MB)                    | Large-file limit                         | Added *.vcd to .gitignore; removed from history via git rmcached     |
| Timeout waiting for completion in flattened top (vvp) | Missing initial-state write into path[0] | In IDLE→INIT, compute & store best initial state in path_flat[1:0]   |

## 9. Common Simulation Issues and GPT Prompts

- Invalid L-value errors in testbench arrays:
  - Prompt: "tb\_viterbi.sv errors: logC['sd0] not valid I-value" → fixed declaration to reg signed [W-1:0] logC [0:I-1];
- Port mismatch (log\_prob\_out not a port of uut):
  - Prompt: "tb\_viterbi.v:82 syntax error" → removed unused signal.
- Undefined outputs (path showing x x x 0):
  - Prompt: "check connections in viterbi\_top.sv" → added explicit two-cycle pipeline (CALC/UPDATE states), ensured delta next → delta prev handshake.
- Simulation hangs (timeout at 1 µs):
  - Prompt: "My viterbi\_top deadlocks in TESTBENCH" → removed obs\_ready, adjusted reset/start handshake, added timeout and fork/join in tb.

### 10. Functional Verification

- SystemVerilog testbench printed per-cycle FSM state and decoded path.
- Cocotb tests (test\_viterbi\_cocotb.py):
  - Prompt: "Integrate Cocotb with flattened logA\_flat signals" → wrote pack\_to\_flat\_signal, unpack in HDL, two test cases (basic, random, edge) fileciteturn1file1.
- pytest runner (test\_viterbi\_runner.py) configured lcarus with waveform generation

## 11. Synthesis & ASIC Flow

- OpenLane 2 config (config.json): enabled SV2V, set CLOCK\_PERIOD = 10 ns, PDN and P&R options fileciteturn1file3.
- Constraints: constraints.sdc, pin order.cfg provided I/O timing and placement guides.
- Modification: flattened all 2D arrays to 1D for P&R compatibility.

## 12. Next Steps & Remaining Work

- Complete OpenLane 2 placement and evaluate timing, area, power. (Currently stuck due to congestion of pins)
- Implement forward-log accelerator (log-sum-exp tree) alongside Viterbi PE array.
- Automate regression in GitHub Actions: nightly SW benchmarks vs. HW performance.

## 12. Topics Learnt and implemented

During this project, I learned these main topics:

- HW/SW Co-Design: partitioning algorithms between software and hardware
- SystemVerilog RTL: multi-dimensional arrays, generate loops, FSMs, pipelining
- Fixed-Point Arithmetic: log-domain representation, two's-complement packing
- **Performance Profiling**: Python's cProfile, line\_profiler, and py-spy for compute/memory hotspots
- Systolic Arrays: design of 1D arrays for ACS operations (max-plus DP)



### Systolic Arrays

- Verification: Icarus Verilog testbenches and cocotb-based Python verification
- ASIC Flow: Yosys synthesis, OpenLane 2 place-&-route, PDN and I/O constraints
- High-Level Synthesis (HLS): mapping Python-style loops to hardware pipelines
- Physical Design: power grid insertion, congestion analysis, constraints scripting
- Documentation & Portfolio: codefest reports, Heilmeier questions, GitHub portfolio curation