# PowerGear: Early-Stage Power Estimation in FPGA HLS via Heterogeneous Edge-Centric GNNs

Zhe Lin<sup>1</sup>, Zike Yuan<sup>2†</sup>, Jieru Zhao<sup>3</sup>, Wei Zhang<sup>4</sup>, Hui Wang<sup>1</sup> and Yonghong Tian<sup>1,5\*</sup>

<sup>1</sup>Peng Cheng Laboratory, China <sup>2</sup>The University of Auckland, New Zealand <sup>3</sup>Shanghai Jiao Tong University, China <sup>4</sup>The Hong Kong University of Science and Technology, Hong Kong, China <sup>5</sup>Peking University, China {linzh01,wangh06,tianyh}@pcl.ac.cn, zyua138@aucklanduni.ac.nz, zhao-jieru@sjtu.edu.cn, wei.zhang@ust.hk

Abstract—Power estimation is the basis of many hardware optimization strategies. However, it is still challenging to offer accurate power estimation at an early stage such as highlevel synthesis (HLS). In this paper, we propose PowerGear, a graph-learning-assisted power estimation approach for FPGA HLS, which features high accuracy, efficiency and transferability. PowerGear comprises two main components: a graph construction flow and a customized graph neural network (GNN) model. Specifically, in the graph construction flow, we introduce buffer insertion, datapath merging, graph trimming and feature annotation techniques to transform HLS designs into graph-structured data, which encode both intra-operation micro-architectures and inter-operation interconnects annotated with switching activities. Furthermore, we propose a novel power-aware heterogeneous edge-centric GNN model which effectively learns heterogeneous edge semantics and structural properties of the constructed graphs via edge-centric neighborhood aggregation, and fits the formulation of dynamic power. Compared with on-board measurement, PowerGear estimates total and dynamic power for new HLS designs with errors of 3.60% and 8.81%, respectively, which outperforms the prior arts in research and the commercial product Vivado. In addition, PowerGear demonstrates a speedup of  $4\times$ over Vivado power estimator. Finally, we present a case study in which PowerGear is exploited to facilitate design space exploration for FPGA HLS, leading to a performance gain of up to 11.2%, compared with methods using state-of-the-art predictive models.

Index Terms—High-level synthesis, graph neural network, power estimation

#### I. INTRODUCTION

Power efficiency has emerged as one of the first-order constraints for hardware systems such as field-programmable gate arrays (FPGAs), and the design optimization with regard to power efficiency usually necessitates the knowledge of power consumption. However, the power evaluation flow for FPGA designs induces large overheads of design turnaround time. In general, accurate FPGA power estimation requires to obtain the signal activities of critical components and I/O ports via vector-based gate-level simulation, and a set of physical component measurements obtained through the register-transfer level (RTL)-based FPGA implementation flow, including synthesis, placement and routing, etc. With the low-level hardware details disclosed after conducting the above steps, analytical models [1] can be used to deduce signal activities for internal components and then infer power consumption. Therein, the cycleaccurate gate-level simulation and the FPGA implementation flow with NP-complete problems give rise to a large amount of runtime. Overall, in a power-oriented hardware optimization

loop, designers have to repeatedly perform the above power evaluation steps while refining the design architecture until power closure is achieved, which incurs long development time and high labor cost for every single design.

To speed up hardware power estimation, great efforts have been made. As for RTL, some works [2], [3] propose to collect a small set of signal activities through RTL simulation and then use them as input features to construct learningbased models for power inference, including decision trees, ensemble models, and convolutional neural networks. These works expedite the power estimation process by replacing the inefficient FPGA implementation flow with efficient modeling strategies. Some other works such as [4] seek to build models to predict gate-level internal signal activities using I/O and register activities from RTL simulation, and leverage the commercial power analysis tools to conduct power estimation after physical implementation, avoiding the tedious vector-based gatelevel simulation step. These works showcase gains in runtime efficiency by skipping either the RTL-based implementation flow or gate-level simulation, while the remaining steps are still time-consuming.

Some prior works [5]-[7] aim at achieving total power estimation at a higher abstraction level, i.e., high-level synthesis (HLS) [8], which greatly accelerates power estimation by dispensing with the low-level simulation and implementation steps in power inference. Some works [5], [6] extract activity features during C-level program execution, and either develop analytical power models for specific types of functions [5] or utilize machine learning models [6] to learn power characteristics of downstream FPGA implementation. These works are designspecific and the built models are not applicable to new designs of interest. A recent work HL-Pow [7] adopts histograms as a way of feature alignment over different designs, which allows power inference for unseen cases. This is accomplished by encoding the activities of each type of HLS operations into a histogram individually, concatenating histograms as overall design features, and then training models to infer power.

Nevertheless, these prior arts on HLS power estimation mainly carry out feature engineering on individual operations, neglecting the impact of interconnects between different operations and the switching activities associated with interconnects, even though interconnects have been proven to significantly contribute to power dissipation [9], especially dynamic power. In order to improve early-stage power estimation for FPGA HLS, we argue that it is vital to take into account the impact of interconnects in power modeling. To this end, we leverage

<sup>†</sup>Work done during Zike Yuan's internship at Peng Cheng Laboratory.

<sup>\*</sup>Corresponding author: Yonghong Tian (tianyh@pcl.ac.cn).

graph neural networks (GNNs) in an effort to jointly learn micro-architectures of HLS operations and interconnects with switching activities. Specifically, we present a graph construction flow to transform the HLS-based hardware designs into graph-structured data, wherein operations are cast as nodes with features indicative of micro-architectures, interconnects are projected as edges with relation types, and netlist activities are encoded as edge features. On this basis, we devise a novel heterogeneous edge-centric GNN model for power modeling, called *HEC-GNN*. Different from general-purpose GNNs that primarily work on node features, HEC-GNN is enhanced with the capability to exploit informative heterogeneous edge semantics and structural properties via the novel edge-centric aggregation scheme, and is aware of power via adaptively approximating the formation of dynamic power.

Putting it all together, we propose a graph-learning-assisted power modeling approach for FPGA HLS, named PowerGear, which distinguishes itself from the prior arts with the following key features: 1) accuracy: PowerGear demonstrates high accuracy for estimating dynamic power besides total power, which has not been jointly studied in prior related works [5]–[7] on HLS; 2) efficiency: compared with RTL-based works [2]-[4] and Vivado integrated power estimator [10], PowerGear gains considerable improvement in the turnaround time of power modeling; and 3) transferability: in contrast to the taskdependent modeling frameworks [2], [3], [5], [6], PowerGear can generalize to previously unseen designs without requiring model retraining. To the best of our knowledge, this is the first attempt to apply customized GNNs for early-stage power estimation in HLS with both total and dynamic power predicted accurately. In all, our major contributions are listed as follows:

- We introduce a graph construction flow to convert the FPGA HLS designs into graph data that preserve both intra-operation and inter-operation power-related features.
- We propose a novel power-aware GNN model, HEC-GNN, which fully mines rich heterogeneous edge semantics and structural properties via the edge-centric neighborhood aggregation, and subtly fits the dynamic power formulation.
- For the first time, we investigate dynamic power estimation for FPGA HLS in addition to total power, and demonstrate how our power estimation approach benefits powerefficient design space exploration in HLS.

#### II. PRELIMINARIES

The FPGA power consumption can be broken down into two components: static power and dynamic power consumption. The static power consumption mainly stems from the leakage currents passing through the transistors whenever the devices are powered up, independently of the workloads running in real time. As for earlier FPGA series, the static power is constant regardless of the implemented designs. However, recent FPGA products, such as Xilinx Ultrascale FPGA, have enabled automatic power gating [11] on unused hard blocks such as logic units and memories, making the static power dependent upon the design scale and utilized resource types.

The dynamic power consumption, on the other hand, is caused by signal switching activities, i.e., transistor switches



Fig. 1: PowerGear overview: in the training stage, graphstructured samples are constructed using HLS results, ground truth power values are collected from real measurement on board after RTL implementation, and a transferable GNN model is trained; in the inference stage, graph samples for new designs are generated after HLS runs and the trained GNN model is seamlessly employed for power prediction.

or register value changes that repeatedly charge and discharge interconnect capacitance. In contrast to static power, dynamic power reflects the runtime workloads driven by specific data rates and data characteristics. The overall dynamic power consumption is the sum of power consumption for each single netlist component triggering signal toggling, which is  $P_{dyn} = \sum_{i \in I} \alpha_i C_i V^2 f, \tag{1}$  where  $\alpha$  is the signal switching activity, C is the interconnect

$$P_{dyn} = \sum_{i \in I} \alpha_i C_i V^2 f, \tag{1}$$

capacitance, V is the supply voltage, f is the operating frequency and i is an interconnect of the whole set I. For a specific FPGA, the supply voltage V is fixed, and C for each i is decided by the FPGA placement and routing algorithms. Hence, given a target design on a specific FPGA, dynamic power is largely determined by runtime workloads that give rise to different signal switching activities and operating frequencies.

# III. POWERGEAR METHODOLOGY

The overview of our proposed approach, PowerGear, is shown in Fig. 1. To start with, we investigate an automated tool flow to capture the power-related information in the HLS designs and generate graph-structured data samples encapsulating both individual and contextual information of various types of design components. Therein, the graph construction is solely based on HLS results, i.e., the intermediate representation (IR) from HLS front-end compilation, and the finite state machine with datapath (FSMD) from HLS back-end optimization. The ground truth power value of each sample for model building is collected by real measurement on FPGA after performing the RTL implementation flow. In the training stage, we develop a novel directed, heterogeneous and edge-centric GNN model to effectively learn the principles behind FPGA power consumption, using the graph-structured features and the corresponding ground truth power values. In the power inference stage, new designs of interest only need to pass through the HLS flow, be converted into graph data in the same way as the training stage, and be fed into the trained GNN model to infer power directly, skipping the RTL implementation and model construction steps in the training stage. In the following parts of this section, we



Fig. 2: Graph construction flow with HLS DFG optimization. separately describe our proposed automated graph construction flow and GNN-assisted power modeling strategy.

# A. Graph Construction Flow

In the graph construction flow, we propose to generate graphstructured samples for FPGA HLS designs in order to make use of GNNs for power learning. HLS tool flows usually provide IR and FSMD information to recover dataflow graphs (DFGs) [12]. Indeed, the DFGs have already presented graph-like data: each DFG node is associated with an IR operation which defines its micro-architecture, and the interconnect between two operations forms a DFG edge. Directly using the HLS DFGs for power modeling, however, is problematic, because primitive DFGs tend to be large-scale but convey limited power-related details, harming both model efficiency and efficacy. To tackle this problem, we introduce four optimization strategies on the HLS DFGs to retrieve and retain from the DFG nodes the hardware components that have significant influence on power consumption, and augment the interconnects, i.e, DFG edges, with power information, which is depicted in Fig. 2.

Buffer insertion. In the DFGs, buffer components, including internal and I/O buffers, have not been declared explicitly. Nevertheless, memory activity is a critical source of power consumption. Hence, we seek to instrument buffers in the DFGs. We observe that memory elements can be inferred from DFG nodes with IR opcodes alloca and getelementptr followed by load or store. Specifically, we perform pattern matching of the above memory-related DFG nodes, insert buffer nodes in DFGs, connect buffer nodes to downstream nodes, and annotate buffer nodes with memory resource utilization.

**Datapath merging.** The DFGs may contain plenty of identical node chains from a source node to a sink node. For instance, loading data from a specific buffer and storing back to another one in different loop execution may cause different IR operations to be instantiated, even though these operations are highly similar. This leads to multiple duplicate DFG node chains generated between two DFG nodes, which deviates from the exact hardware realization. To restore the real hardware implementation from the DFGs, we perform datapath merging to fuse identical DFG node chains between two nodes. Furthermore, we try to account for resource sharing between different FSM stages in RTL, and to this end, we merge the DFG nodes utilizing the same set of hardware resources.

Graph trimming. The DFGs encompass paths that are not computationally intensive. We bypass DFG nodes that contribute little to arithmetic computation and also produce trivial

hardware entities, e.g., bit truncation and signed extension. This helps to improve the modeling efficiency by reducing the scale of the generated graph samples, and suppress noise by making the model focus more on crucial arithmetic-intensive datapaths.

**Feature annotation.** The DFGs do not reflect any signal toggling intrinsically. As the signal activity is the trigger of dynamic power consumption, we propose to integrate signal activities in interconnects, i.e., edges of graph samples for power learning. First, we annotate the signals across each edge of interest. That is, we identify the dataflow variables (operands or results of IR operations) transferred through the edges from the source nodes to the sink nodes. Then, we instrument detection probes to trace the values of variables of interest in the IR level. Next, we link together the testbenches with input stimuli, the instrumented IR and the function entities of detection probes following the method of [7]. Subsequently, we compile them into a single executable, and run the executable to acquire traced variable values in different execution cycles. Finally, we compute edge switching activities  $SA_{edge}$  by considering data injecting in and going out of edges individually. This gives

$$SA_{edge} = \left\{ \frac{\sum_{i=1}^{N_{\boldsymbol{v}}^{dir}} \operatorname{HD}\left(\boldsymbol{v}_{dir}(i), \boldsymbol{v}_{dir}(i-1)\right)}{L} \middle| dir \in \{src, snk\} \right\}, \quad (2)$$
 where  $\boldsymbol{v}$  is the set of bit vectors of variable values that pass

through the edge and change at different execution cycles, dir is the dataflow direction that separates v produced by source nodes (src) from v utilized by sink nodes (snk) of the edge,  $N_n^{dir}$  is the number of execution cycles that cause the change of v with dir, L is the latency of the design, and  $HD(\cdot)$  is the Hamming distance computation which counts the number of variable bits that are different in two execution cycles. Besides  $SA_{edge}$ , we also extract the activation rate for each edge, which is defined as

$$AR_{edge} = \left\{ \frac{N_{v}^{dir}}{L} \mid dir \in \{src, snk\} \right\}. \tag{3}$$

 $AR_{edge} = \left\{ \frac{N_{\boldsymbol{v}}^{dir}}{L} \;\middle|\; dir \in \{src, snk\} \right\}. \tag{3}$  Putting it all together, we construct four-dimensional edge features comprising the switching activities and the activation rates of both the source and sink variables through the edge. We further classify the DFG nodes into two categories: arithmetic (A) and non-arithmetic (N) nodes. Correspondingly, the edges are annotated with four types of source-to-sink relations:  $A \rightarrow N$ ,  $A \rightarrow A$ ,  $N \rightarrow A$ , and  $N \rightarrow N$ . As for nodes, we use the IR operation type, IR operation opcode, overall activation rate, input, output and overall switching activities as node features. The IR operation type and opcode are categorical features with one-hot encoding while the others are numeric features.

# B. Heterogeneous Edge-Centric GNN Model

The graphs generated by our proposed construction flow exhibit the following characteristics: 1) heterogeneity: the graph edges have different relation types, indicating whether the corresponding datapaths comprise arithmetic operations; 2) edge expressivity: signal switching activities, the major contributors to dynamic power as shown in Eq. 1, are intactly encoded in the edges, which demonstrates that the edge features have higher expressive capability than node features in terms of power consumption; 3) directionality: the edges are directed, meaning that data can only pass through an edge from the source node to the sink node, while the opposite is infeasible.

The mainstream GNN models [13], [14] mostly pay attention to node feature aggregation. Some recent works [15], [16] also support using edge features as the supplement to node features, whereas node features still play a dominant role in message passing. These existing node-centric GNN models, however, are ineffective in capturing heterogeneous edge semantics that are essential in the context of power estimation. In light of this, we develop a power-aware heterogeneous edge-centric GNN model, named HEC-GNN, which makes full use of edges in neighborhood aggregation to benefit power modeling.

**HEC-GNN** convolutional layer. Each graph sample built with our construction flow can be represented as G $(\mathcal{V}, \mathcal{E}, \mathcal{R})$ , where  $\mathcal{V}, \mathcal{E}$  and  $\mathcal{R}$  denote the set of graph nodes, edges and relation types, respectively. Formally, our proposed HEC-GNN collects information from neighbors and updates node embeddings at the k-th convolutional layer as

 $\boldsymbol{h}_{v}^{(k)} = \sigma \left( \boldsymbol{W}_{\mathcal{V}}^{(k)} \boldsymbol{h}_{v}^{(k-1)} + AGG^{(k)} \left( \left\{ \boldsymbol{e}_{u,v,r} | r \in \mathcal{R}, u \in \mathcal{N}_{v}^{r} \right\} \right) \right), \quad (4)$ where  $\boldsymbol{h}_{v}^{(k)}$  is the embedding vector of node  $v \in \mathcal{V}$ ,  $\boldsymbol{e}_{u,v,r}$  is the edge feature vector of the directed edge from node  $\boldsymbol{u}$  to node v with relation type  $r \in \mathcal{R}$  and  $(u, v, r) \in \mathcal{E}$ ,  $\mathcal{N}_v^r$  is the set of nodes with the successor being node v and the edge relation type being r,  $\boldsymbol{W}_{\mathcal{V}}^{(k)}$  is the learnable weight matrix for updating node embeddings from the last layer,  $\sigma$  is the ReLU activation function, and  $AGG^{(k)}$  represents the aggregation function in the k-th layer. HEC-GNN distinguishes itself from existing GNNs by mainly aggregating information from edge feature vectors. Moreover, HEC-GNN retains heterogeneity by modeling the interconnects between nodes with different relation types and separately gathering information of each relation type in the aggregation scheme.

Our HEC-GNN aggregation mechanism 
$$AGG^{(k)}$$
 is given as  $AGG^{(k)}(\{e_{u,v,r}|r\in\mathcal{R},u\in\mathcal{N}_v^r\}) = \sum_{r\in\mathcal{R}}\sum_{u\in\mathcal{N}_v^r} \boldsymbol{W}_r^{(k)}\boldsymbol{W}_{\mathcal{E}}^{(k)}\boldsymbol{e}_{u,v,r},$  (5)

where  $\boldsymbol{W}_{\mathcal{E}}^{(k)}$  and  $\boldsymbol{W}_{r}^{(k)}$  are learnable weight matrices for all edges and the relation type  $r \in \mathcal{R}$ , respectively. Recall that in Eq. 1, the dynamic power can be viewed as weighted aggregation of interconnect activity  $\alpha_i$  with weight being  $C_iV^2f$ . The heuristic behind Eq. 5 is to simulate the formation of dynamic power consumption via the HEC-GNN aggregation and update process. Note that the activity  $\alpha_i$  has already been encoded in the edge feature vector  $e_{u,v,r}$  with interconnect i = (u,v,r), our goal is to adaptively fit the weight term  $C_iV^2f$  via learnable weight matrices. To achieve this, we first use a global weight matrix  $W_{\mathcal{E}}^{(k)}$  to learn common knowledge from all types of edges, which corresponds to the equation term  $V^2 f$ . Next, we simplify the intricate interconnects using multiple relation types. This enables us to approximate the interconnect capacitance  $C_i$  using relation-specific interconnect capacitance capacitance  $C_i$  using relation-specific interconnect capacitance  $C_r$ , which is produced by the weight matrix  $\boldsymbol{W}_r^{(k)}$ . Putting it all together, the edge feature vector  $\boldsymbol{e}_{u,v,r}$  reflects the activity term  $\alpha_i$ , the global weight matrix  $\boldsymbol{W}_{\mathcal{E}}^{(k)}$  accounts for  $V^2f$ , while the relation-specific weight matrices  $\boldsymbol{W}_r^{(k)}$ ,  $\forall r \in \mathcal{R}$ , match with the interconnect capacitance  $C_i$ . Hence, the HEC-GNN aggregation mechanism perceives dynamic power consumption by subtly fitting the dynamic power formula.



Fig. 3: Overall architecture of HEC-GNN for power learning.

**HEC-GNN overall architecture**. Fig. 3 depicts the overall architecture of our HEC-GNN model for power learning. First, the graph data are fed into multiple HEC-GNN convolutional layers to learn and produce node embeddings. After that, the graph-level embedding  $m{h}_G$  is obtained via sum pooling:  $m{h}_G = \sum_{k \in \mathcal{K}} \sum_{v \in \mathcal{V}} m{h}_v^{(k)},$ 

$$\boldsymbol{h}_G = \sum_{v \in \mathcal{V}} \sum_{v \in \mathcal{V}} \boldsymbol{h}_v^{(k)}, \tag{6}$$

where  $\mathcal K$  is the set of indexes of graph convolutional layers. Instead of simply pooling the node embeddings from the last convolutional layer, we sum node embeddings obtained from different graph convolutional layers, which can be viewed as a form of skip connection to enhance generalization ability.

In addition to leveraging the graph embeddings that encapsulate localized information mainly contributing to dynamic power, we also exploit some global metedata from HLS reports that are complementary to the graph information and are indicative of static power. These metadata include the global resource utilization (LUT, DSP and BRAM), timing information (latency and achieved clock period) in HLS, and the scaling factors, i.e., the ratio of the above design metrics over those of the unoptimized baseline. As shown in Fig. 3, we use a multi-layer perceptron (MLP) with one fully connected layer followed by ReLU activation to embed the above global metedata. Thereafter, the two sets of embeddings, i.e., the graph embedding  $h_G$  and the global metadata embedding  $h_M$ , are concatenated to form a holistic embedding. Finally, the holistic embedding is fed into another MLP with two fully connected layers and ReLU activation in between. The output of this MLP is the total or dynamic power estimation  $P_{est.}$ :

$$P_{est.} = \text{MLP}(\boldsymbol{h}_G \mid\mid \boldsymbol{h}_M). \tag{7}$$

The above model components are cascaded to constitute our end-to-end supervised model, HEC-GNN, which is trained via regression to minimize the mean average percentage error loss. Furthermore, we adopt an ensemble learning strategy, in which we perform 10-fold cross-validation together with three different random seeds to generate different training and validation sets for model generation, and average all the output of trained models to get the final prediction results. Overall, HEC-GNN is heterogeneous, edge-centric and directed.

# IV. EXPERIMENTAL RESULTS

The graph construction flow of PowerGear is implemented with Python for HLS data extraction and graph generation, and C++ for IR modification and detection probe implementation. The proposed HEC-GNN model and the baseline GNNs [13]-[16] are developed with PyTorch Geometric [17] and Scikitlearn [18] toolkits. We evaluate our approach using Poly-

TABLE I: Dataset properties, results of total and dynamic power estimation, and runtime speedup over Vivado power estimator.

| Dataset | Dataset  | Properties  | Erro   | r of Total Po | ower (%)  | Error of Dynamic Power (%) |           |           |       |        |           | Runtime |
|---------|----------|-------------|--------|---------------|-----------|----------------------------|-----------|-----------|-------|--------|-----------|---------|
| Dataset | #Samples | Avg. #Nodes | Vivado | HL-Pow        | PowerGear | GCN                        | GraphSage | GraphConv | GINE  | HL-Pow | PowerGear | Speedup |
| Atax    | 521      | 141         | 15.04  | 5.99          | 6.15      | 13.96                      | 14.59     | 11.93     | 15.45 | 14.92  | 11.18     | 4.08×   |
| Bicg    | 489      | 137         | 13.61  | 4.42          | 4.97      | 11.96                      | 11.35     | 11.06     | 10.44 | 12.21  | 9.65      | 3.96×   |
| Gemm    | 531      | 341         | 25.59  | 3.22          | 2.75      | 11.02                      | 9.66      | 9.53      | 8.92  | 10.35  | 8.32      | 2.09×   |
| Gesummv | 529      | 221         | 28.27  | 2.74          | 2.67      | 16.59                      | 14.33     | 12.14     | 11.89 | 13.52  | 9.35      | 10.81×  |
| 2mm     | 483      | 443         | 24.34  | 3.52          | 4.47      | 8.51                       | 8.87      | 10.32     | 8.93  | 9.14   | 6.81      | 1.99×   |
| 3mm     | 483      | 447         | 16.93  | 2.19          | 2.63      | 12.55                      | 12.06     | 11.55     | 10.92 | 12.60  | 8.62      | 2.31×   |
| Mvt     | 531      | 165         | 19.03  | 2.98          | 2.77      | 13.94                      | 12.09     | 9.94      | 10.04 | 12.38  | 8.77      | 7.69×   |
| Syrk    | 530      | 320         | 26.24  | 5.37          | 3.76      | 14.53                      | 14.51     | 13.83     | 11.70 | 14.50  | 8.64      | 1.47×   |
| Syr2k   | 483      | 444         | 27.36  | 3.70          | 2.26      | 13.42                      | 9.75      | 8.82      | 12.24 | 14.45  | 7.98      | 2.13×   |
| Average | 509      | 295         | 21.82  | 3.79          | 3.60      | 12.94                      | 11.91     | 11.01     | 11.17 | 12.67  | 8.81      | 4.06×   |

bench [19] datasets with different dataset properties shown in Table I. HLS design samples are generated by applying loop pipelining, loop unrolling and buffer partitioning. We also include some synthetic datasets to increase the diversity of loop patterns in training. The hardware designs are developed with Vivado and VivadoHLS 2018.2 and implemented on Xilinx Ultrascale+ ZCU102 FPGA board with the frequency of 100 MHz. Ground truth power values are obtained by power measurement with Power Advantage Tool [20]. Software design flows run on 80-core Intel Xeon CPU at 2.4 GHz with one Nvidia Tesla V100 GPU.

Regarding the hyperparameters of GNNs, we employ three layers of graph convolution with a hidden dimension of 128, a batch size of 128, a dropout rate of 0.2 and a learning rate of 0.0005. We train the GNN models with 1200 and 2400 epochs for total and dynamic power estimation, respectively. 20% of the training data are used as the validation set.

# A. Estimation Accuracy and Runtime Speedup

In this experiment, we evaluate both total and dynamic power estimation accuracy and runtime speedup for HLS-based FPGA designs using PowerGear. We leave one target application out of the nine applications as the test dataset, and use all the others for training. With this leave-one-out training scheme, we can verify the transferability of the models on all nine datasets. We compare PowerGear with the Vivado power estimator [10], the state-of-the-art work HL-Pow [7], and some mainstream GNN models [13]–[16]. As for Vivado power estimation, we import the gate-level netlist after physical implementation and provide .saif activity files via vector-based simulation, which ensures a high confidence level of estimation precision. Moreover, we observe through experiments that the Vivado power estimator neglects the impact of power gating [11] on unused hard blocks, leading to a severe deviation from real power consumption. Hence, we further calibrate the results with a linear regression model. As for HL-Pow [7], we follow its design flow and implement the gradient boosting decision tree (GBDT) models. Similar to GNNs, we use 20% of training data for validation, based on which we tune the hyperparameters with tree size in [10, 500], tree depth in [5, 10], minimum samples per leaf in [2, 8], and learning rate in  $\{0.005, 0.01, 0.05\}$ .

As shown in Table I, regarding total power estimation, Vivado power estimation diverges considerably from the real measurement, leading to an average error of 21.82%, whereas the HL-Pow and our proposed PowerGear achieve good prediction accuracy with average errors of 3.79% and 3.60%, respectively. In general, PowerGear achieves the best average accuracy for

TABLE II: Error (%) of dynamic power estimation using different HEC-GNN variants.

| Dataset | $\ $ | W/o opt. |   | W/o e.f. |   | W/o dir. |   | W/o hetr. |   | W/o md. |   | Sgl.  |     | Prop. |
|---------|------|----------|---|----------|---|----------|---|-----------|---|---------|---|-------|-----|-------|
| Atax    | П    | 13.32    | I | 11.40    | 1 | 11.84    | Ī | 11.39     | I | 12.37   | Τ | 11.68 | П   | 11.18 |
| Bicg    | II   | 10.63    |   | 10.56    | ı | 10.19    | ١ | 9.98      | İ | 11.08   |   | 10.06 | ij  | 9.65  |
| Gemm    | Ш    | 10.04    |   | 10.39    | 1 | 8.53     |   | 9.48      |   | 8.66    |   | 8.79  |     | 8.32  |
| Gesummv |      | 13.83    |   | 11.14    |   | 9.99     |   | 10.31     |   | 9.74    |   | 9.68  |     | 9.35  |
| 2mm     | Ш    | 10.25    |   | 8.67     | ı | 6.97     |   | 7.64      | İ | 7.47    |   | 6.93  | ii. | 6.81  |
| 3mm     | II   | 11.77    |   | 9.62     | ı | 9.29     | ı | 9.15      | İ | 11.93   |   | 8.96  | ij  | 8.62  |
| Mvt     | II   | 12.54    |   | 9.40     | İ | 8.70     | ı | 9.71      | İ | 8.81    |   | 8.47  | ij  | 8.77  |
| Syrk    | ij   | 12.64    |   | 10.92    | İ | 8.95     | ı | 9.51      | İ | 10.75   |   | 9.04  | ii. | 8.64  |
| Syr2k   |      | 9.62     |   | 9.66     | İ | 8.51     |   | 8.97      |   | 7.16    |   | 8.10  | II  | 7.98  |
| Average |      | 11.74    |   | 10.20    | 1 | 9.22     | I | 9.57      |   | 9.77    |   | 9.08  |     | 8.81  |

total power estimation. Regarding dynamic power estimation, we compare PowerGear with HL-Pow and commonly used GNN models [13]-[16]. Overall, PowerGear outperforms all baseline methods on all the evaluated datasets, significantly reducing the estimation error of dynamic power from 12.67% down to 8.81% compared with HL-Pow. Results verify that PowerGear's graph construction flow successfully retains interconnect and activity features that have important implication for power consumption, and moreover, the HEC-GNN model proposed in PowerGear effectively mines rich heterogeneous edge semantics and structural properties via the edge-centric neighborhood aggregation mechanism. In comparison to the GNN baselines, including GCN [13] and GraphSage [14] that simply focus on node features, and GraphConv [16] and GINE [15] which use both node and edge features, PowerGear still reaps notable accuracy improvement. Our insight is that PowerGear intelligently fits the dynamic power formulation in HEC-GNN aggregation and additionally accounts for overall power characteristics using metadata embeddings, leading to high efficacy and generalization ability for power modeling.

With PowerGear, we can realize efficient FPGA power estimation by simply executing the HLS flow and skipping the tedious cycle-accurate simulation, low-level synthesis and physical implementation steps. This gives rise to the runtime speedup of  $1.47{-}10.81{\times}~(4.06{\times}~on~average)$  over the Vivado power estimation process.

# B. Ablation Study

To gain better understanding of various optimization strategies adopted by HEC-GNN in PowerGear, we conduct an ablation study which evaluates the accuracy of dynamic power estimation using six variants of HEC-GNN in addition to the proposed HEC-GNN denoted as *prop*. for short. These models include: 1) *w/o opt.*: single unoptimized HEC-GNN without considering edge features, directionality, heterogeneity, and global metadata embeddings; 2) *w/o e.f.*: single HEC-GNN without using edge features; 3) *w/o dir.*: single HEC-GNN



Fig. 4: Pareto frontiers of Atax and Mvt with PowerGear. TABLE III: ADRS of HLS-based design space exploration.

| Sampling<br>Budget |        | Prediction M<br>HL-Pow | Iodel<br>  PowerGear | PowerGear Gains<br>v.s. Vivado   v.s. HL-Po |       |  |  |
|--------------------|--------|------------------------|----------------------|---------------------------------------------|-------|--|--|
| 20%                | 0.1657 | 0.1050                 | 0.0981               | 39.2%                                       | 6.9%  |  |  |
| 30%                | 0.1520 | 0.0841                 | 0.0774               | 45.8%                                       | 10.4% |  |  |
| 40%                | 0.1423 | 0.0691                 | 0.0626               | 52.0%                                       | 11.2% |  |  |

without directionality; 4) w/o hetr.: single HEC-GNN without using multiple relation types; 5) w/o md.: single HEC-GNN without using the global metadata embeddings; and 6) sgl.: single optimized HEC-GNN without considering ensemble.

As shown in Table II, our proposed HEC-GNN model is consistently superior to its six variants in seven out of nine datasets for dynamic power estimation with FPGA HLS, and finally yields the best accuracy on average. It can be inferred from the results that the adoption of edge features, directionality, heterogeneity, global metadata embeddings and ensemble all contribute to the enhancement of model accuracy, which explains the effectiveness of HEC-GNN that jointly makes use of all these optimization strategies.

#### C. Case Study

We demonstrate a case study in which PowerGear is exploited as the prediction model to guide fast HLS-based hardware design space exploration (DSE) to trade off between latency and dynamic power. Given a dataset for DSE, we first sample a small subset of design points for HLS and then utilize PowerGear to estimate dynamic power. Together with the set of latency derived from HLS, we compute the dynamic power-latency Pareto frontier using existing sampling points, based on which a sampling algorithm [7] is applied to select promising design points that are most likely to be Paretooptimal for further evaluation. The above steps are conducted iteratively to calibrate the approximate Pareto frontier until the total sampling budget is met. Experiments are performed with an initial sampling budget of 2% and total sampling budgets of 20%, 30% and 40%, respectively. Vivado power estimator and HL-Pow are used as alternative power prediction models for comparison. A commonly used evaluation metric average distance from reference set (ADRS) is employed to quantify the difference between the approximate and exact Pareto frontiers:

ween the approximate and exact Pareto frontiers 
$$ADRS(\Gamma, \Omega) = \frac{1}{|\Gamma|} \sum_{\gamma \in \Gamma} \min_{\omega \in \Omega} f(\gamma, \omega), \tag{8}$$

where  $\Omega$  is the approximate Pareto-optimal set,  $\Gamma$  is the exact Pareto-optimal set, and  $f(\cdot)$  computes the distance between two design points  $\omega \in \Omega$  and  $\gamma \in \Gamma$ . A lower ADRS means that the approximate Pareto set is closer to the exact one.

Fig. 4 shows the DSE results of two datasets under a total sampling budget of 40%, which indicate that PowerGear can

effectively aid in DSE algorithms to realize close approximation of Pareto-optimal sets with moderate runtime overhead. Moreover, as shown in Table III, the ADRS of Pareto search using PowerGear as the prediction model is the best for all three sampling budgets. Compared with the methods using Vivado and HL-Pow, PowerGear-assisted DSE yields relative performance gains of 39.2–52% and 6.9–11.2%, respectively. Through this case study, we show that PowerGear's capability to provide accurate dynamic power estimation in FPGA HLS opens up more opportunities for optimizing dynamic power.

#### V. CONCLUSION

We propose PowerGear, an early-stage power estimation approach for FPGA HLS. PowerGear constructs graph data from HLS in a way that preserves operations, interconnects and switching activities, and moreover, PowerGear subtly exploits edge features for power learning via HEC-GNN. Experiments prove that PowerGear is accurate, efficient and transferable, and can notably boost the quality of DSE in FPGA HLS.

#### ACKNOWLEDGMENT

This research was supported by Key-Area Research and Development Program of Guangdong Province under grant No. 2019B010155002.

#### REFERENCES

- D. Liu and C. Svensson, "Power consumption estimation in CMOS VLSI chips," *IEEE Journal of Solid-State Circuits*, 1994.
- [2] Z. Lin, S. Sinha, and W. Zhang, "An ensemble learning approach for in-situ monitoring of FPGA dynamic power," TCAD, 2018.
- [3] Y. Zhou, H. Ren, Y. Zhang, B. Keller, Z. Zhang *et al.*, "PRIMAL: Power inference using machine learning," in *Proc. of DAC*, 2019.
- [4] Y. Zhang, H. Ren, and B. Khailany, "GRANNITE: Graph neural network inference for transferable power estimation," in *Proc. of DAC*, 2020.
- [5] W. Zuo, W. Kemmerer, J. B. Lim, L.-N. Pouchet et al., "A polyhedral-based SystemC modeling and generation framework for effective low-power design space exploration," in *Proc. of ICCAD*, 2015.
- [6] D. Lee, L. K. John, and A. Gerstlauer, "Dynamic power and performance back-annotation for fast and accurate functional hardware simulation," in *Proc. of DATE*, 2015.
- [7] Z. Lin, J. Zhao et al., "HL-Pow: A learning-based power modeling framework for high-level synthesis," in Proc. of ASP-DAC, 2020.
- [8] P. Coussy, D. D. Gajski, M. Meredith et al., "An introduction to high-level synthesis," *IEEE Design Test of Computers*, 2009.
- [9] L. Zhong and N. K. Jha, "Interconnect-aware high-level synthesis for low power," in *Proc. of ICCAD*, 2002.
- [10] Xilinx Ltd, "Vivado design suite tutorial: Power Analysis and Optimization," Xilinx White Paper, October 2019.
- [11] S. Kolluri, "UltraScale Architecture Low Power Technology Overview," Xilinx White Paper, October 2015.
- [12] E. Ustun, C. Deng, D. Pal et al., "Accurate operation delay prediction for fpga hls using graph neural networks," in Proc. of ICCAD, 2020.
- [13] T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," in *Proc. of ICLR*, 2017.
- [14] W. L. Hamilton, R. Ying, and J. Leskovec, "Inductive representation learning on large graphs," in *Proc. of NeurIPS*, 2017.
- [15] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang et al., "Strategies for pretraining graph neural networks," in Proc. of ICLR, 2020.
- [16] C. Morris, M. Ritzert, M. Fey et al., "Weisfeiler and leman go neural: Higher-order graph neural networks," in Proc. of AAAI, 2019.
- [17] M. Fey and J. E. Lenssen, "Fast graph representation learning with PyTorch Geometric," in *ICLR Workshop*, 2019.
- [18] F. Pedregosa, G. Varoquaux et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, 2011.
- [19] L.-N. Pouchet. (2012) Polybench: The polyhedral benchmark suite. [On-line]. Available: http://www.cs.ucla.edu/%7Epouchet/software/polybench
- [20] Xilinx Ltd, "Zynq UltraScale+ MPSoC Power Advantage Tool 2018.1," Xilinx Wiki, 2018.