# How to generate an FPGA accelerator for a quantized Bert model

## Overview

In this tutorial, we'll see how to load a Bert model from the Mase model library, optimize it by quantizing the weights, then emit the SystemVerilog code for a custom dataflow accelerator, ready to be deployed on an Intel or Xilinx FPGA. This involves using generating a computation graph for the model, then invoking several Mase compiler passes. First, we go through this in detail, discussing the steps required. Then, we show how to use the `chop.pipelines` pass managers to encapsulate all this functionality within a single function call. Finally, we'll run the generated [Cocotb](https://www.cocotb.org/) testbench to evaluate the throughput and latency of the emitted accelerator.

This tutorial assumes you have a working Mase environment. Follow the instructions [here](https://deepwok.github.io/mase/modules/documentation/getting_started.html) to get started using Conda or Docker. You will also need a working Questa installation to run the testbench of the accelerator. If you don't have Questa available, you can also use Verilator, however the runtime may be very large.

In [None]:
# Set up a logger
from chop.tools import get_logger

logger = get_logger(__name__)
logger.setLevel("INFO")

## Import and quantize the model

First, let's import the Bert model from Mase's [patched model library](https://github.com/DeepWok/mase/tree/main/src/chop/models). We'll define a small configuration with 3 layers and a hidden size of 96. We'll also define a quantization configuration which specifies the fixed-point precision we want to run the model with.

In [None]:
from chop.models.patched.bert import BertConfig, BertModel

config = BertConfig()
config.num_hidden_layers = 3
config.hidden_size = 96
config.intermediate_size = 384

q_config = {
    "data_in_width": 8,
    "data_in_frac_width": 3,
    "weight_width": 8,
    "weight_frac_width": 3,
    "bias_width": 8,
    "bias_frac_width": 3,
    "data_out_width": 8,
    "data_out_frac_width": 3,
}

model = BertModel(config)

Now that the model is defined, we are ready to quantize it by writing a module-level pass. This simply iterates through the modules in the Pytorch model and replaces the relevant ones with their quantized equivalents. In the Bert model, the relevant modules that need to be quantized are:
1. The self attention layer
2. Linear layers
3. Layer normalization layer
4. GELU activation layer

You can see that Mase has a library of quantized neural network layers under the `chop.nn.quantized` API. See [here](https://github.com/DeepWok/mase/tree/main/src/chop/nn) for a full reference of the available modules.

In [None]:
import torch.nn as nn
from transformers.activations import GELUActivation
from chop.models.patched.bert.modeling_bert import BertSelfAttention
from chop.nn.quantized import (
    BertSelfAttentionInteger,
    LinearInteger,
    LayerNormInteger,
    GELUInteger,
)
from chop.passes.graph.utils import deepsetattr


def bert_module_level_quantize(model, model_config, q_config):
    for module in model.named_modules():
        if isinstance(module[1], BertSelfAttention):
            new_module = BertSelfAttentionInteger(
                model_config, q_config, output_tensor_only=True
            )
        elif isinstance(module[1], nn.Linear):
            new_module = LinearInteger(
                in_features=module[1].in_features,
                out_features=module[1].out_features,
                bias=module[1].bias is not None,
                config=q_config,
            )
        elif isinstance(module[1], nn.LayerNorm):
            new_module = LayerNormInteger(
                normalized_shape=module[1].normalized_shape,
                eps=module[1].eps,
                config=q_config,
            )
        elif isinstance(module[1], GELUActivation):
            new_module = GELUInteger(config=q_config)
        else:
            continue
        logger.info(f"Replacing module: {module[0]}")
        deepsetattr(model, module[0], new_module)
    return model


model = bert_module_level_quantize(model, config, q_config)
logger.info(f"Quantized BERT model: {model}")

## Emit SystemVerilog code for the accelerator: step-by-step

Now that the model is quantized, we are ready to run Mase's FX compiler flow. This involves extracting a computation graph from the Pytorch model leveraging Pytorch FX (see details [here](https://pytorch.org/docs/stable/fx.html)), then running a few analysis and transformation passes on this graph until it's ready for emitting the Verilog code for the dataflow accelerator. First, we'll do this step-by-step, then we'll see how to automate all these operations with a single function call, using the `chop.pipelines` API. 

In either case, we start by generating the computation graph through a process called symbolic tracing. As discussed in the `torch.fx` documentation, this involves running a forward pass of the model using dedicated `fx.Proxy` objects as the arguments, instead of the regular `torch.Tensor`s. These proxies record every operation executed on them, which is then used to generate the computation graph. Each node in the generated graph can be a single sublayer, such as `nn.Linear`, or fine-grained function call such as `torch.matmul`. For the emit verilog flow, we require the graph to be at layer granularity, meaning the internal function calls of each layer are hidden in the graph. To achieve this, we pass a `custom_ops` dictionary to the MaseGraph constructor, which instructs the FX tracer to skip this layer during FX tracing. We also provide the desired implementation for the self attention layer, which is available in the [Mase Components](https://github.com/DeepWok/mase/tree/main/src/mase_components) library.

In [None]:
from chop.ir import MaseGraph
from mase_components import get_module_dependencies

BERT_CUSTOM_OPS = {
    "modules": {
        BertSelfAttentionInteger: {
            "args": {
                "hidden_states": "data_in",
                "attention_mask": None,
                "head_mask": None,
                "encoder_hidden_states": None,
                "encoder_attention_mask": None,
                "past_key_value": None,
                "output_attentions": "config",
            },
            "toolchain": "INTERNAL_RTL",
            "module": "fixed_self_attention_single_precision_wrapper",
            "dependence_files": get_module_dependencies(
                "attention/fixed_self_attention_single_precision_wrapper"
            ),
        },
    },
    "functions": {},
}

mg = MaseGraph(model, custom_ops=BERT_CUSTOM_OPS)

Once the bert model graph is generated, we can start with the analysis passes, which annotate the graph with relevant information, without changing the topology of the nodes and edges. The `add_common_metadata_analysis_pass` performs shape propagation, i.e. running a forward pass on the model to annotate each node with tensor metadata for each of the operator's input and output tensors. `add_hardware_metadata_analysis_pass` builds on top of this, annotating each node with the verilog parameters which will later be used by the pass that emits the SystemVerilog code. One crucial aspect is the `max_parallelism` parameter, which corresponds to the number of arithmetic cores in each hardware submodule, affecting the resource consumption and latency performance of the resulting hardware. The `patch_metadata_transform_pass` pass annotates the fixed-point precision according to the quantiation configuration for a subset of nodes which are relevant for the control flow of the generated hardware. For more information about each pass, see the [pass API documentation](https://deepwok.github.io/mase/modules/api/passes.html).

In [None]:
import torch
import chop.passes as passes

# Redefine some configuration parameters
CONFIG_BATCH_SIZE = 1
CONFIG_SEQUENCE_LENGTH = 4
MAX_PARALLELISM = 4
WAIT_COUNT = 15
WAIT_UNIT = "ms"


mg, _ = passes.init_metadata_analysis_pass(mg)

# * Add metadata analysis passes
mg, _ = passes.add_common_metadata_analysis_pass(
    mg,
    pass_args={
        "dummy_in": {
            "input_ids": torch.randn(
                (CONFIG_BATCH_SIZE, CONFIG_SEQUENCE_LENGTH, config.hidden_size)
            )
        },
        "add_value": False,
    },
)

mg, _ = passes.patch_metadata_transform_pass(
    mg,
    pass_args={
        "precision": "fixed",
        "q_config": q_config,
    },
)

mg, _ = passes.add_hardware_metadata_analysis_pass(
    mg,
    pass_args={
        "max_parallelism": [MAX_PARALLELISM] * 4,
    },
)

At this stage, we are ready to execute the graph transformation passes, which use the annotated metadata to change the topology of the graph such that it is ready for verilog emit. The `emit_verilog_top_transform_pass` generates the SystemVerilog top-level file, while `emit_internal_rtl_transform_pass` copies the relevant submodules from the [Mase Components](https://github.com/DeepWok/mase/tree/main/src/mase_components) SystemVerilog library to the user's workarea. The `emit_bram_transform_pass` pass emits the BRAM modules which store the weights and biases on the FPGA for each layer in the model. A Cocotb testbench is generated in the `emit_cocotb_transform_pass`, which can be used for testing the generated hardware using real Pytorch datasets. Finally, `emit_vivado_project_transform_pass` prepares a Vivado project containing the emitted Verilog code, making it ready for Synthesis and Implementation on the FPGA board.

In [None]:
# Define the timeout time for the generated testbench
WAIT_COUNT = 15
WAIT_UNIT = "ms"

mg, _ = passes.emit_verilog_top_transform_pass(mg)
mg, _ = passes.emit_bram_transform_pass(mg)
mg, _ = passes.emit_internal_rtl_transform_pass(mg)
mg, _ = passes.emit_cocotb_transform_pass(
    mg,
    pass_args={
        "wait_time": WAIT_COUNT,
        "wait_unit": WAIT_UNIT,
    },
)
mg, _ = passes.emit_vivado_project_transform_pass(mg)

Hoorah!

## Emit SystemVerilog code for the accelerator: with automation

Now we've seen everything Mase does under the hood, but we don't want to write that much code each time we generate Verilog for a new model. Luckily, the workflow for every model is very similar, and can be abstracted into a pass manager, which runs a default set of passes. This is achieved through the AutoPipeline API.

In [None]:
from chop import AutoPipelineForEmitVerilog

# Redefine some configuration parameters
CONFIG_BATCH_SIZE = 1
CONFIG_SEQUENCE_LENGTH = 4
WAIT_COUNT = 15
WAIT_UNIT = "ms"
MAX_PARALLELISM = 4

mg = MaseGraph(model, custom_ops=BERT_CUSTOM_OPS)

pipeline = AutoPipelineForEmitVerilog()
mg = pipeline(
    mg,
    pass_args={
        "add_common_metadata_analysis_pass": {
            "dummy_in": {
                "input_ids": torch.randn(
                    (
                        CONFIG_BATCH_SIZE,
                        CONFIG_SEQUENCE_LENGTH,
                        config.hidden_size,
                    )
                )
            },
            "add_value": False,
        },
        "patch_metadata_transform_pass": {
            "q_config": q_config,
        },
        "add_hardware_metadata_analysis_pass": {
            "max_parallelism": [MAX_PARALLELISM] * 4,
        },
        "report_node_meta_param_analysis_pass": {
            "which": ["common", "hardware"],
            "save_path": "llama_graph_meta_params.txt",
        },
        "emit_cocotb_transform_pass": {
            "wait_time": WAIT_COUNT,
            "wait_unit": WAIT_UNIT,
        },
    },
)

## Evaluate the generated accelerator

Now everything is ready, and the generated Verilog files can be found under `~/.mase/top/hardware/rtl`. You can inspect the `top.sv` file to see how data is propagated from the inputs of the module through every layer in the original model. You can also find the emitted Cocotb test under `~/.mase/top/hardware/test.py`. Note that the Cocotb testbench class is not emitted as a text file, but rather pickled and stored as a .dill file, which is a compressed way of sharing the testbench. This is then unpickled and instantiated in the `test.py` file which is executed by the Cocotb runner. Now, simply run the `simulate` action to obtain the latency for a single batch inference pass.

In [None]:
import os
import chop.actions as actions

os.environ["COCOTB_RESOLVE_X"] = "ZEROS"
actions.simulate(
    skip_build=False, skip_test=False, gui=False, waves=False, simulator="questa"
)

## Conclusion

In this tutorial, we demonstrated the process of generating an FPGA accelerator for a quantized BERT model using the Mase framework. We began by loading a BERT model and defining its configuration and quantization parameters, then proceeded to quantize the model at the module level. Next, we walked through the detailed steps of emitting SystemVerilog code for the accelerator, which included generating a computation graph using Torch FX, performing various metadata analysis passes, and transforming the graph to be ready for Verilog emission. We showed how to automate these steps using the chop.pipelines API, greatly simplifying the workflow. Finally, we ran the generated Cocotb testbench to evaluate the performance of the accelerator, obtaining throughput and latency metrics.

By following this tutorial, you should now have a solid understanding of how to optimize transformer models for FPGA deployment using Mase, from quantization to hardware code generation and performance evaluation. If you are interested in experimenting further, we propose the following suggested exercises.

1. Re-run the flow by changing the q_config dictionary to try different fixed-point precisions. In each case, open the generated Vivado project and launch the synthesis flow to compare the resource consumption of the generated hardware. Create a plot of the LUT, FF and DSP utilization statistics for a range of fixed-point precisions.

2. Repeat exercise 1, but this time experiment with the maximum parallelism parameter. Again, compare the resource consumption for a range of parallelism parameters. This time, also run the Cocotb testbench in each iteration to see how the parallelism affects the inference latency. Based on this analysis, can you suggest an optimal design point that trades off resource consumption with inference latency?

If you are interested in contributing to the Mase project, we suggest the following extension task.

3. Try to support this flow for a new model, such as Llama, Mistral or GPT. Follow the steps in the [documentation]() to import a new model into Mase from the HuggingFace hub, and try running the `AutoPipelineForVerilogEmit`. If that doesn't work directly, see the hints in the [debugging guide]() to support the new model, now you know the steps required.