# Part 1: Systolic Array Accelerator Demo

This notebook demonstrates the Part 1 systolic array convolution accelerator running on the RFSoC 4x2 FPGA.

## Overview
- Load the FPGA bitstream with the systolic array accelerator
- Write activation data to the accelerator
- Run the convolution computation (9 kernel positions)
- Read back results and verify against expected output

## Register Map
| Offset | Name | R/W | Description |
|--------|------|-----|-------------|
| 0x000 | CTRL | R/W | [0]=start, [1]=done(ro), [2]=idle(ro) |
| 0x004 | STATUS | R | [7:0]=state, [11:8]=kij_counter |
| 0x008-0x094 | ACT_IN[0-35] | W | 36 × 32-bit activation inputs |
| 0x100-0x1FC | OUT[0-15] | R | 16 × 128-bit output values |

In [1]:
from pynq import Overlay
import numpy as np
import time

## 1. Load the Overlay

Load the bitstream onto the FPGA. The `.bit` and `.hwh` files should be in the same directory as this notebook.

In [2]:
# Load the overlay
ol = Overlay("part1_pynq.bit")
print("Overlay loaded successfully!")
print(f"IP blocks: {ol.ip_dict.keys()}")

Overlay loaded successfully!
IP blocks: dict_keys(['systolic_array_wrapper_0', 'zynq_ultra_ps_e_0'])


In [3]:
# Get reference to the systolic array IP
ip = ol.systolic_array_wrapper_0
print(f"IP address: 0x{ip.mmio.base_addr:08X}")
print(f"IP range: {ip.mmio.length} bytes")

IP address: 0x80000000
IP range: 65536 bytes


## 2. Define Helper Functions

In [4]:
# Register offsets
ADDR_CTRL = 0x000
ADDR_STATUS = 0x004
ADDR_ACT_BASE = 0x008
ADDR_OUT_BASE = 0x100

# Constants
NUM_ACTIVATIONS = 36
NUM_OUTPUTS = 16
COL = 8
PSUM_BW = 16

def read_ctrl():
    """Read control register and decode status bits."""
    val = ip.read(ADDR_CTRL)
    return {
        'start': bool(val & 0x1),
        'done': bool(val & 0x2),
        'idle': bool(val & 0x4)
    }

def read_status():
    """Read status register."""
    val = ip.read(ADDR_STATUS)
    return {
        'state': val & 0xFF,
        'kij': (val >> 8) & 0xF
    }

def start_computation():
    """Start the computation by setting the start bit."""
    ip.write(ADDR_CTRL, 0x1)

def wait_for_done(timeout_s=5.0):
    """Wait for computation to complete."""
    start_time = time.time()
    while True:
        ctrl = read_ctrl()
        if ctrl['done']:
            return True
        if time.time() - start_time > timeout_s:
            print(f"Timeout! Status: {read_status()}")
            return False
        time.sleep(0.001)

def write_activations(activations):
    """Write activation data (list of 36 x 32-bit values)."""
    for i, act in enumerate(activations):
        ip.write(ADDR_ACT_BASE + i * 4, int(act))

def read_outputs():
    """Read all 16 output values (128-bit each)."""
    outputs = []
    for i in range(NUM_OUTPUTS):
        # Each 128-bit output spans 4 x 32-bit registers
        val = 0
        for j in range(4):
            word = ip.read(ADDR_OUT_BASE + i * 16 + j * 4)
            val |= word << (j * 32)
        outputs.append(val)
    return outputs

def decode_output(val_128bit):
    """Decode 128-bit output into 8 x 16-bit signed values."""
    result = []
    for i in range(COL):
        val_16 = (val_128bit >> (i * PSUM_BW)) & 0xFFFF
        # Convert to signed
        if val_16 >= 0x8000:
            val_16 = val_16 - 0x10000
        result.append(val_16)
    return result

## 3. Load Activation Data

Load the activation data from the test file.

In [5]:
def load_activation_file(filepath):
    """Load activation data from file (skips first 3 comment lines)."""
    activations = []
    with open(filepath, 'r') as f:
        lines = f.readlines()
    
    # Skip first 3 comment lines
    for line in lines[3:]:
        line = line.strip()
        if line and len(line) == 32:
            activations.append(int(line, 2))
    
    return activations

def load_output_file(filepath):
    """Load expected output data from file (skips first 3 comment lines)."""
    outputs = []
    with open(filepath, 'r') as f:
        lines = f.readlines()
    
    # Skip first 3 comment lines
    for line in lines[3:]:
        line = line.strip()
        if line and len(line) == 128:
            outputs.append(int(line, 2))
    
    return outputs

In [6]:
# Load test data
# Note: Adjust path as needed based on where files are located on the PYNQ board
activation_file = "activation.txt"  # Copy from software/part1/
output_file = "output.txt"  # Copy from software/part1/

try:
    activations = load_activation_file(activation_file)
    print(f"Loaded {len(activations)} activation values")
    expected_outputs = load_output_file(output_file)
    print(f"Loaded {len(expected_outputs)} expected output values")
except FileNotFoundError as e:
    print(f"Error loading files: {e}")
    print("\nPlease copy activation.txt and output.txt from software/part1/ directory")
    activations = None
    expected_outputs = None

Loaded 36 activation values
Loaded 16 expected output values


## 4. Check Initial State

In [7]:
# Check initial state
print("Control register:", read_ctrl())
print("Status register:", read_status())

Control register: {'start': False, 'done': False, 'idle': True}
Status register: {'state': 0, 'kij': 0}


## 5. Run Computation

In [8]:
if activations is not None:
    # Write activation data
    print("Writing activation data...")
    write_activations(activations)
    print(f"  Wrote {len(activations)} values")
    
    # Start computation
    print("\nStarting computation...")
    start_time = time.time()
    start_computation()
    
    # Wait for completion
    if wait_for_done(timeout_s=10.0):
        elapsed = time.time() - start_time
        print(f"Computation completed in {elapsed*1000:.2f} ms")
    else:
        print("Computation did not complete!")
        print(f"  Control: {read_ctrl()}")
        print(f"  Status: {read_status()}")

Writing activation data...
  Wrote 36 values

Starting computation...
Timeout! Status: {'state': 0, 'kij': 0}
Computation did not complete!
  Control: {'start': False, 'done': False, 'idle': True}
  Status: {'state': 0, 'kij': 0}


## 6. Read and Verify Results

In [9]:
if activations is not None:
    # Read outputs
    print("Reading outputs...")
    hw_outputs = read_outputs()
    
    # Verify against expected
    print("\n" + "="*60)
    print("Verification Results")
    print("="*60)
    
    errors = 0
    for i in range(NUM_OUTPUTS):
        hw_val = hw_outputs[i]
        exp_val = expected_outputs[i] if i < len(expected_outputs) else 0
        
        if hw_val == exp_val:
            status = "PASS ✓"
        else:
            status = "FAIL ✗"
            errors += 1
        
        hw_decoded = decode_output(hw_val)
        print(f"Output {i:2d}: {status}")
        if hw_val != exp_val:
            print(f"  HW:  {hw_val:032X}")
            print(f"  Exp: {exp_val:032X}")
            print(f"  Decoded: {hw_decoded}")
    
    print("\n" + "="*60)
    if errors == 0:
        print("All outputs match!")
    else:
        print(f"{errors}/{NUM_OUTPUTS} outputs have errors")
    print("="*60)

Reading outputs...

Verification Results
Output  0: PASS ✓
Output  1: PASS ✓
Output  2: PASS ✓
Output  3: PASS ✓
Output  4: PASS ✓
Output  5: PASS ✓
Output  6: PASS ✓
Output  7: PASS ✓
Output  8: PASS ✓
Output  9: PASS ✓
Output 10: PASS ✓
Output 11: PASS ✓
Output 12: PASS ✓
Output 13: PASS ✓
Output 14: PASS ✓
Output 15: PASS ✓

All outputs match!


## 7. Display Decoded Outputs

In [10]:
if activations is not None:
    print("\nDecoded Output Values (16 x 8 channels):")
    print("-" * 80)
    
    # Create a numpy array for better visualization
    output_matrix = np.zeros((NUM_OUTPUTS, COL), dtype=np.int16)
    
    for i, hw_val in enumerate(hw_outputs):
        decoded = decode_output(hw_val)
        output_matrix[i, :] = decoded
    
    # Display as table
    print("      ", end="")
    for c in range(COL):
        print(f"  Ch{c:d}  ", end="")
    print()
    
    for i in range(NUM_OUTPUTS):
        print(f"Out{i:2d}:", end="")
        for c in range(COL):
            print(f" {output_matrix[i,c]:6d}", end="")
        print()


Decoded Output Values (16 x 8 channels):
--------------------------------------------------------------------------------
        Ch0    Ch1    Ch2    Ch3    Ch4    Ch5    Ch6    Ch7  
Out 0:      9      4      0     37     38      0      0      0
Out 1:      0      0      0     67     14      0      0      0
Out 2:     36      0      0     55      8      0      0      0
Out 3:     62      0     19     26      0     12      0      0
Out 4:     60      0      0      0     81     50      0     17
Out 5:     89      0      0      0     83     44      0      5
Out 6:     84      0     31      0     70      0      0     10
Out 7:     73      0     95     16      0      0      0      0
Out 8:    105      0      0     13    145      4      0     17
Out 9:    150      0      0     53    208      0      0     39
Out10:     61     21     23    106    180      0      0     55
Out11:      0      0     82     86     45      0      0     24
Out12:      0     14      0     30    124      0      0   

## 8. Performance Summary

In [11]:
print("Performance Summary")
print("="*60)
print(f"Systolic Array Size: 8x8")
print(f"Bit Width: 4-bit (weights/activations), 16-bit (accumulators)")
print(f"Kernel Positions: 9 (3x3 convolution)")
print(f"Input Activations: 36 (6x6 padded)")
print(f"Output Feature Map: 16 (4x4)")
print(f"Output Channels: 8")
print(f"Total MAC Operations: {9 * 36 * 8 * 8} per inference")
print(f"Clock Frequency: 100 MHz (default)")

Performance Summary
Systolic Array Size: 8x8
Bit Width: 4-bit (weights/activations), 16-bit (accumulators)
Kernel Positions: 9 (3x3 convolution)
Input Activations: 36 (6x6 padded)
Output Feature Map: 16 (4x4)
Output Channels: 8
Total MAC Operations: 20736 per inference
Clock Frequency: 100 MHz (default)


## Notes

- The weights are pre-loaded in BRAM at synthesis time
- To change weights, the bitstream must be regenerated
- The accelerator computes all 9 kernel positions and accumulates them internally
- ReLU is applied after accumulation