
## Intelligent Architectures (5LIL0) Assignment 3
 
<div style="display: flex; justify-content: center;">
    <figure style="background-color: lightgray; padding: 10px; display: inline-block; text-align: center;">
        <img src="./imgs/lab_3_overview.png" alt="Overview of Lab 3">
        <figcaption><strong>Figure 1:</strong> Overview of Lab 3.</figcaption>
    </figure>
</div>



### Simulating a digital hardware convolution accelerator

### Introduction:
 
The main objective of this lab is to **simulate the digital hardware implementation of a convolution accelerator**. To achieve this goal you will learn how to transition from software simulations of simple neural networks to their implementations in custom digital hardware. Furthermore, you will be provided with an input-reuse dataflow accellerator implementation and will be asked to implement a weight-reuse version of it. These concepts will be treaded in lecture "DNN HW design principles & efficient dataflow" and you will need those for the second part of the assigment. So don't worry if now don't sound familiar you will know it very soon.

This lab is divided into two parts: **Lab 3A (Software Simulation)** and **Lab 3B (Hardware Simulation)**, as illustrated in the provided diagram. Each part focuses on specific aspects of transitioning a neural network from software to hardware.

The workflow above explains the process of transitioning neural networks from software simulations to hardware implementations, using Icarus Verilog as a tool for hardware simulation and PyTorch as a tool for exporting the network. Verilog is a hardware description language (HDL) commonly used to design and simulate digital systems, such as custom accelerators for neural networks. For this exercise we will use System Veriog instead of plain Verilog. System Verilog is simply and extension of Verilog with similified datatypes, multi-dimensional arrays and object oriented programming. Iverilog is a widely used open-source simulator for Verilog/System Verilog that allows you to design digital harware and verify its functionality and performance before physical implementation. 
 

In **Lab 3A**, you will first load a pre-trained neural network (consisting of convolutional and fully connected layers) and evaluate its performance on the MNIST dataset in PyTorch at full precision (FP32) to establish a baseline. In hardware implementations, lower precision formats, such as 4-bit or 8-bit fixed-point representations, are preferred over the standard FP32 used in software. Operating at lower precision significantly reduces memory consumption and computational complexity, enabling faster processing and lower energy consumption, all critical factors in hardware design. As the main goal is to design and test an hardware accellerator for the convolutional part of the network, you will quantize only the kernel values of the convolution to 4-bit fixed-point (INT4), while keeping the rest of the network to FP32. For the hardware simulation of the convolution operations, the quantized kernel values must be loaded in the multi-dimensional arrays defined in System Verilog. To do so, you will export the quantized model's parameters of the convolutional layer into memory initialization files (.mem) compatible with hardware simulator. Notably, these files must be in a specific format and you will be asked to write a conversion function that stores the quantized parameters in a suitable form for .mem files. 


In **Lab 3B**, these exported memory initialization files will be loaded into the multi-dimensional array hosting the kernels of the convolution accelerator. You will then count the clock cycles required to perform the full convolution using the provided convolutional accelerator with input-reuse dataflow. After this, you will implement a weight-reuse dataflow and measure the clock cycles again to compare performance. To measure the number of clock-cycles you will use GTKWave, a waveform visualization tool used to analyze signals in hardware simulations.

 
Through this workflow, you will gain hands-on experience with the practical considerations of moving from software to hardware, including the use of HDLs like Verilog, the benefits of quantization, and the trade-offs involved in custom digital accelerator design. Specifically, the boxes in blue in the image above must be implemented by you while the others will be provided. **Thus, in this lab you will learn how to:**
 
**1. Export kernel parameters into a file suitable for memory initialization of multi-dimensional arrays of hardware simulation.**

**2. Load these files into the simulated memory using System Verilog with Icarus compiler.**

**3. Implement a weight stationary dataflow convolutional accelerator.**


## Part A: Loading pre-trained model and exporting weights

The main steps of this section are:

1. Create a convolution + fully connected model class 
2. Load the pre-trained parameters from the .pth file 
3. Measure the test accuracy of the model using FP32 on the MNIST dataset
4. Quantize the weights to INT8 and measure the accuracy drop 
5. Export the quantized weights to a memory initialization file (.mem) for memory initialization in verilog

The blue steps in the Figure 1 above indicate the parts you need to complete. 

## Steps 1 & 2 

In [4]:
# Ensure reproducibility across CO machines
import os 
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["MKL_CBWR"] = "COMPATIBLE"

# Import required packages 
import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import copy
import matplotlib.pyplot as plt

np.random.seed(seed=0)
torch.manual_seed(0)

# Set the torch device to use cpu 
device = torch.device("cpu")

# Class definition of the model
output_chan = 3
class simple_CNN(nn.Module):
    def __init__(self):
        super(simple_CNN, self).__init__()
        self.conv = nn.Conv2d(in_channels=1, out_channels=output_chan, kernel_size=3, stride=1, padding=0, bias=False)
        self.relu = nn.ReLU()
        self.mlp = nn.Linear(26*26*output_chan, 10, bias=False)

        torch.manual_seed(42)
        nn.init.xavier_uniform_(self.conv.weight)
        nn.init.xavier_uniform_(self.mlp.weight)

    def forward(self, x):


        # Here you need to implement the convolutional layer
        x = self.conv(x)
        h1 = self.relu(x)
        h1 = h1.view(h1.shape[0], -1)
        output = self.mlp(h1)
        
        """Pass the input throgh the network"""
        return output


# Load pre-trained kernels into the model 
model = simple_CNN().to(device)
state_dict = torch.load('./simple_CNN.pth', weights_only=True)


# Load into the model
model.load_state_dict(state_dict)
model.eval().to(device)

simple_CNN(
  (conv): Conv2d(1, 3, kernel_size=(3, 3), stride=(1, 1), bias=False)
  (relu): ReLU()
  (mlp): Linear(in_features=2028, out_features=10, bias=False)
)

## Step 3

In [5]:
# Load the MNIST datasets
def binarize(image):
    return (image > 0.5).float()

transform = transforms.Compose([transforms.ToTensor(), transforms.Lambda(binarize)])
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform, download=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Create a test function to measure accuracy 
def test(model, test_loader):
    model.eval()
    correct_cases = 0
    for batch_idx, (data, target) in enumerate(test_loader):
        output = model(data.to(device))
        target = target.to(device)
        correct_cases += (output.argmax(1) == target).sum().item()
    
    print(f'Accuracy {correct_cases/len(test_loader.dataset)}')

test(model, test_loader)
print(f"The correct Accuracy should be: 0.9648")

Accuracy 0.9648
The correct Accuracy should be: 0.9648


## Step 4

In the previous steps we created a model class and initialized the kernels with the pre-trained values of simple_CNN.pth. Now, the goal is to create a version of the model with quantized weights (post-training quantization) and measure the test accuracy. To acheive this goal we take the pre-train kernels and quantize them to integer precision of 4 bits. 

Use the following code and quantize the weight to the range [-8,7] with 4 bit resolution.

In [6]:
def lower_precision(model, bits=8):
    model_copy = copy.deepcopy(model)
    for param in model_copy.parameters():
        param.data = (param.data * 2**(bits-1)).round().clamp(-2**(bits - 1), 2**(bits - 1) - 1)
    return model_copy

low_prec_model = lower_precision(model, bits=4)
low_prec_model.eval()
test(low_prec_model, test_loader)

Accuracy 0.9626


## Step 5

Now, we need to export the quantized weights into a format suitable for memory initialization in our hardware simulation. In SystemVerilog, multi-dimensional arrays store these quantized weights and are initialized when an instance of the convolution accelerator is created.

To determine the correct format (**hexadecimal two complement format**) and ordering of kernel values in the memory file (**KERNEL, ROW, COLUMN**), refer to the input-reuse module description ([`src/input_reuse.sv`](src/input_reuse.sv)). Since SystemVerilog expects values in either binary or hexadecimal format, your task in this section is to generate a  [`kernel.mem`](.mem)  file containing the quantized weights in the appropriate format.


In [None]:
# (TODO: Remove in handout: only used to generate golden files)
# goldens = enumerate(test_loader)
# batch_idx, (golden_datas, golden_targets) = next(goldens)
# golden_data = golden_datas[0].unsqueeze(0)

# golden_target = golden_targets[0]
# golden_result = low_prec_model(golden_data.to("cpu"))


# TODO: FOLLOWING DONE BY STUDENTS 
golden_weights = low_prec_model.conv.weight.flatten().detach().numpy().astype(np.int32)

def memory_init_file_gen(weights, file_path, bits=4):

    # TODO: STUDENT PART {
    def to_hexa_complement(value, bits):
        """Converts an integer to its hexadecimal two's complement representation."""
        ## TO DO CONVERT AN INTEGER TO ITS HEXDEC 2's COMP AND RETURN
        if value < 0:
            value += (1 << bits)
        return hex(value)[2:].upper()
        
    binary_weights = [to_hexa_complement(w, bits) for w in weights]

    with open(file_path, 'w') as f:
        for weight in weights:
            f.write(to_hexa_complement(weight, bits) + '\n')

    return binary_weights

binary_weights = memory_init_file_gen(golden_weights, './src/files/golden_kernel.mem', bits=4)

        
print(binary_weights[:9])
print(f"The first 9 correct results of manually conversion is ['4', '4', '4', 'C', 'E', '3', 'C', 'C', 'C']")


['4', '4', '4', 'C', 'E', '3', 'C', 'C', 'C']
The first 9 correct results of manually conversion is ['4', '4', '4', 'C', 'E', '3', 'C', 'C', 'C']


## Step 6

Now, to get familiar with hardware design, you are asked to write a module and a testbench in SystemVerilog.
The module gets as input a 96-bits string and find a match in the file, it returns the starting memory position of the match, -1 if not found.

Also, please make a testbench that test the following strings "11111111 11111111 11111111 11101110 11111111 11111111 11111111 11110100 11111111 11111111 11111111 11111100" in what position is it located?

**Note that you should connect to the coX machines and run the SystemVerilog modules and testbenches in there! For this process you should refer to the lecture slides or to Part B of this lab.**

**The following two cells are snippet of SystemVerilog code (for your reference)**

```systemverilog
## SYSTEM VERILOG MODULE TEMPLATE - for your reference and to be saved in string_match.sv
module string_match (
    input logic clk,
    input logic rst,
    input logic [95:0] search_string,  // 96-bit input string
    output logic found,                // 1 if found, 0 if not found
    output logic signed [31:0] position // Start position of match (-1 if not found)
);

    localparam MEMORY_SIZE = 2028;  // Example memory size (32-bit words)
    logic [31:0] memory [0:MEMORY_SIZE-1]; // Memory array (32-bit words)

    initial begin
        // Load data from file into memory
        $readmemh("golden_conv.mem", memory);
    end


    // TO DO COMPLETE
    logic [31:0] search_word0, search_word1, search_word2;
    assign search_word0 = search_string[31:0];
    assign search_word1 = search_string[63:32];
    assign search_word2 = search_string[95:64];

    always_ff @(posedge clk or posedge rst) begin
        if (rst) begin
            found <= 0;
            position <= -1;
        end else begin
            found <= 0;
            position <= -1;
            for (int i = 0; i < MEMORY_SIZE - 2; i++) begin
                if (memory[i] == search_word2 && memory[i+1] == search_word1 && memory[i+2] == search_word0) begin
                    found <= 1;
                    position <= i;
                    break;
                end
            end
        end
    // TO DO COMPLETE
    end
endmodule
```


```systemverilog
## SYSTEM VERILOG TESTBENCH TEMPLATE - for your reference and to be saved in string_match_tb.sv
module string_match_tb;

    logic clk;
    logic rst;
    logic [95:0] search_string;
    logic found;
    logic signed [31:0] position;

    // TO INSTANTIATE THE MODULE
    string_match uut (
        .clk(clk),
        .rst(rst),
        .search_string(search_string),
        .found(found),
        .position(position)
    );

    // Clock generation
    always #5 clk = ~clk;

    initial begin
        // -------------
        clk = 0;
        rst = 1;
        #10 rst = 0;
        // -------------
        
        // Test input string (concatenation of three 32-bit words)
        search_string = 96'b11111111111111111111111111101110_11111111111111111111111111110100_11111111111111111111111111111100;

        // TO DO COMPLETE 
        #20;
        
        // Display results
        if (found)
            $display("Match found at position: %d (in 32-bit words)", position);
        else
            $display("Match not found.");

    end
endmodule
```

expected output ->   Match found at position: ??? (in 32-bit words)

***Please note that the testbench output of part A could be a quiz question in canvas***

## Part B: Hardware Accelerator Simulation 

In Part A of this lab, you quantized the convolutional layer's kernels to INT4 and exported them as a memory initialization file (.mem) for use in hardware simulation. Now, in Part B, you will focus on understanding the hardware simulation process. You will begin by exploring how hardware is described using System Verilog and how to compile it using Icarus Verilog (Iverilog). Then, you will simulate the hardware's behavior and analyze waveforms using GTKWave. Additionally, you will study different dataflow strategies, such as input-reuse and weight-reuse, as introduced in the accompanying slides. To make you life easier, an input-reuse module will be provided. Your task is to implement a weight-reuse version of this module, applying the concepts discussed in class.

Notably, this lab primarily involves working with the Linux shell rather than Python. This provides an excellent opportunity to improve your command-line skills, which are fundamental for any form of hardware design.

### Harware Simulation Flow
Three tools are necessary for running hardware simulations: **Icarus verilog** to compile the system verilog descriptions of the hardware; **vvp** to run a simulation of the hardware and **GTKwave** to visualize the waveforms. The standard flow to move from harware description language (.sv) to simulation and waveform viewing is illustrated in Figure 2 below.

As you can see, the inputs to the Icarus Verilog compiler are two: 

- **testbench.sv** -> provides and environment to simulate and verify the functionality of the module.sv (also referred to as Device Under Test or DUT) by providing inputs (such as memory init files) and monitor its output.
- **module.sv** -> the file defining the actual hardware module (e.g. a convolution accelerator). 

<div style="background-color: lightgray; text-align: center; padding: 10px;">
    <figure style="margin: 0;">
        <img src="imgs/hw_toolchain.png" alt="Image description" style="display: inline-block;">
        <figcaption style="color: black;"><strong>Figure 2: Toolchain</strong> </figcaption>
    </figure>
</div>

### Directory organization

Inside **/src** directory you will find the following files: 

- **input_reuse.sv** -> module definition of the convolution accelerator using input reuse 
- **testbench.sv** -> testbench for the module 


Inside the **/src/files** directory you can find the following: 

- **golden_output.mem** -> memory initialization file to verify matching of the hardware convolution with the software 
- **golden_input.mem** -> memory initialization file for the image
- **golden_kernel.mem** -> memory initialization file for the kernel (generated in Part A)

Note that there is no need for you to modify any of these memory initialization files as they are only used to ensure you have a match between the software simulation and the hardware simulation. 


### Input-reuse CNN accelerator example

 
<div style="display: flex; justify-content: center;">
    <figure style="background-color: lightgray; padding: 10px; display: inline-block; text-align: center;">
        <img src="./imgs/input_reuse.png" alt="Overview of Lab 3">
        <figcaption style="color: black;"><strong>Figure 3: Input-reuse</strong> </figcaption>
    </figure>
</div>


In this section you will learn how to run the example for the input-reuse dataflow accelerator depicted in Figure 3, where red represents memory modules, yellow denotes buffers, and blue compute units. As you can see, for the input-reuse, a window of the input image is loaded into the buffer from the main memory, while the kernel are streamed to the convolution compute unit. You should investigate the **testbench.sv** to understand how memory initialization files are loaded and how we interact with the convolutional module defined in **input_reuse.sv**. 

The procedure to run a full simulation and open the GTK wave is as follows: 

1. Log-in to CO machines ensuring graphical forwarding is enabled as explained in the lecture.
2. Compile both the module and the testbench by running `iverilog -g2012 -o out ./src/testbench.sv ./src/input_reuse.sv`. The output of this command is the `out` file which can be used for simulation in the next step. 
3. Simulate the testbench by running `vvp out`. This commnand runs the simulations and creates a `waveform.vcd` file which can be used to visualize the waves. 
4. To observe the simulation waveforms run `gtkwave waveform.vcd`
5. Open the GTKwave and learn how to use cursors to measure time differences (question about this will be in the quizz)

Notably, the simulation will print matching between the convolution results obtained in software and stored in (golden_conv.mem) with the output of the convolution in the hardware simulation.
### Weight-reuse CNN accelerator example

<div style="display: flex; justify-content: center;">
    <figure style="background-color: lightgray; padding: 10px; display: inline-block; text-align: center;">
        <img src="./imgs/weight_reuse.png" alt="Overview of Lab 3">
        <figcaption style="color: black;"><strong>Figure 4: Weight-reuse</strong> </figcaption>
    </figure>
</div>


In the previous section, you explored the hardware simulation flow and gained hands-on experience with an input-reuse dataflow accelerator. Now, it's time for the most exciting part of this lab: designing your own convolution accelerator with weight-reuse. As depicted in Figure 4, for the weight-reuse, the kernels must be loaded from memory to the buffer while the image is streamed duirectly from memory to the convolution processing. 

To accomplish this, you will need to:

1. Create a new file named **weight_reuse.sv**, where you will define your weight-reuse accelerator.
2. Modify **testbench.sv** to integrate and test your newly implemented module.
3. Ensure that all convolution results match.
4. Upload the **weight_reuse.sv** in the canvas assignment page. 
   
This step will allow you to compare different dataflow strategies and gain a deeper understanding of hardware-efficient convolution operations. 
